ML for Finance

Course 4: Machine Learning for Financial Markets

Course Overview

Duration Modules Exercises Level
~45 hours 14 + Capstone ~95 Intermediate to Advanced

What You'll Learn

  • Apply machine learning to predict market movements
  • Build robust feature engineering pipelines
  • Train and evaluate classification and regression models
  • Analyze sentiment from news and social media
  • Deploy production ML systems with monitoring
  • Avoid common pitfalls unique to financial ML

Course Structure

Part 1: ML Fundamentals for Finance

Build the foundation for financial machine learning.

Module Title Key Topics
1 ML Concepts for Traders Supervised/unsupervised, why finance is different
2 Data Preparation Time series splits, cross-validation, imbalanced data
3 Feature Engineering Price features, technical indicators, statistical features
4 Target Engineering Triple barrier method, meta-labeling, lookahead bias

Part 2: Classification Models

Master the models used for direction prediction.

Module Title Key Topics
5 Tree-Based Models Decision trees, Random Forest, XGBoost, LightGBM
6 Other Classification Models Logistic regression, SVM, neural networks
7 Model Evaluation Financial metrics, confusion matrix, ROC curves

Part 3: Advanced Techniques

Expand beyond classification into specialized domains.

Module Title Key Topics
8 Regression Models Return prediction, volatility forecasting, quantile regression
9 Sentiment Analysis Text processing, sentiment scoring, news signals
10 Alternative Data Web scraping, social media, multi-source features

Part 4: Deep Learning & Production

Deploy models to production with proper infrastructure.

Module Title Key Topics
11 Deep Learning for Finance Neural networks, LSTM, transformers
12 Backtesting ML Strategies Walk-forward optimization, avoiding pitfalls
13 Production ML Systems Model deployment, feature pipelines, monitoring
14 Advanced ML Topics Reinforcement learning, ensembles, online learning

Why Financial ML is Different

Machine learning in finance faces unique challenges that don't exist in other domains:

# Financial ML Challenges

challenges = {
    'Low Signal-to-Noise': {
        'description': 'Financial data is extremely noisy',
        'implication': 'Models easily overfit to noise instead of signal',
        'solution': 'Robust validation, regularization, feature selection'
    },
    'Non-Stationarity': {
        'description': 'Market dynamics change over time',
        'implication': 'Models trained on past data may not work on future data',
        'solution': 'Walk-forward validation, adaptive models, regime detection'
    },
    'Regime Changes': {
        'description': 'Markets shift between bull/bear/sideways regimes',
        'implication': 'A model that works in one regime may fail in another',
        'solution': 'Regime-aware models, ensemble approaches'
    },
    'Adversarial Environment': {
        'description': 'Other traders adapt to profitable strategies',
        'implication': 'Alpha decays as strategies become crowded',
        'solution': 'Continuous innovation, unique data sources'
    },
    'Lookahead Bias': {
        'description': 'Easy to accidentally use future information',
        'implication': 'Backtests look great but live trading fails',
        'solution': 'Strict point-in-time data, purging, embargo'
    }
}

for challenge, details in challenges.items():
    print(f"\n{challenge}")
    print(f"  Problem: {details['description']}")
    print(f"  Risk: {details['implication']}")
    print(f"  Solution: {details['solution']}")

The ML Pipeline for Trading

# The Financial ML Pipeline

pipeline_stages = """
┌─────────────────────────────────────────────────────────────────────────────┐
│                        FINANCIAL ML PIPELINE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

1. DATA COLLECTION
   ├── Price data (OHLCV)
   ├── Fundamental data
   ├── Alternative data (news, social, satellite)
   └── Point-in-time considerations


2. DATA PREPARATION
   ├── Handle missing data
   ├── Adjust for corporate actions
   ├── Time series train/test split
   └── Avoid lookahead bias


3. FEATURE ENGINEERING
   ├── Price-based features (returns, volatility)
   ├── Technical indicators
   ├── Statistical features (z-scores, percentiles)
   └── Feature selection


4. TARGET ENGINEERING
   ├── Define prediction target
   ├── Triple barrier method
   ├── Meta-labeling
   └── Sample weighting


5. MODEL TRAINING
   ├── Select algorithm(s)
   ├── Hyperparameter tuning
   ├── Cross-validation (time series aware)
   └── Ensemble methods


6. EVALUATION
   ├── ML metrics (accuracy, F1, AUC)
   ├── Financial metrics (Sharpe, returns)
   ├── Walk-forward testing
   └── Statistical significance


7. DEPLOYMENT
   ├── Feature pipeline
   ├── Real-time prediction
   ├── Model monitoring
   └── Retraining triggers
"""

print(pipeline_stages)

Key Libraries

# Core libraries used throughout this course

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Financial data
import yfinance as yf

print("Core libraries imported successfully!")
print(f"\nVersions:")
print(f"  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")

Quick Preview: A Simple ML Trading Model

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# 1. Get data
print("1. Fetching data...")
ticker = yf.Ticker("SPY")
df = ticker.history(period="2y")
print(f"   Downloaded {len(df)} days of SPY data")

# 2. Create features
print("\n2. Engineering features...")
df['returns'] = df['Close'].pct_change()
df['sma_5'] = df['Close'].rolling(5).mean()
df['sma_20'] = df['Close'].rolling(20).mean()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum'] = df['Close'].pct_change(10)

# Feature: distance from moving average
df['dist_sma_5'] = (df['Close'] - df['sma_5']) / df['sma_5']
df['dist_sma_20'] = (df['Close'] - df['sma_20']) / df['sma_20']

# 3. Create target (next day direction)
print("3. Creating target labels...")
df['target'] = (df['returns'].shift(-1) > 0).astype(int)

# 4. Prepare data
features = ['dist_sma_5', 'dist_sma_20', 'volatility', 'momentum']
df_clean = df.dropna()

X = df_clean[features]
y = df_clean['target']

# 5. Time series split (respects temporal order)
print("\n4. Splitting data (time series aware)...")
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"   Training: {len(X_train)} samples")
print(f"   Testing: {len(X_test)} samples")

# 6. Train model
print("\n5. Training Random Forest...")
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# 7. Evaluate
print("\n6. Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\n   Accuracy: {accuracy:.2%}")
print(f"   (Random baseline: 50%)")

# Feature importance
print("\n7. Feature Importance:")
for feat, imp in sorted(zip(features, model.feature_importances_), key=lambda x: -x[1]):
    print(f"   {feat}: {imp:.3f}")

Prerequisites Check

Before starting this course, ensure you're comfortable with:

# Prerequisite skills check

prerequisites = {
    'Python Fundamentals': [
        'Variables and data types',
        'Functions and classes',
        'List comprehensions',
        'File I/O'
    ],
    'Pandas & NumPy': [
        'DataFrames and Series',
        'Indexing and selection',
        'GroupBy operations',
        'Vectorized operations'
    ],
    'Basic Statistics': [
        'Mean, variance, standard deviation',
        'Correlation and covariance',
        'Normal distribution',
        'Hypothesis testing basics'
    ],
    'Financial Concepts': [
        'Returns calculation',
        'Risk metrics (volatility)',
        'Technical indicators basics',
        'Market order types'
    ]
}

print("Prerequisites for this course:\n")
for category, skills in prerequisites.items():
    print(f"{category}:")
    for skill in skills:
        print(f"  - {skill}")
    print()

Capstone Preview

By the end of this course, you'll build a Production ML Trading System that includes:

  • Multi-source data pipeline (price + sentiment + alternative)
  • Feature engineering library with 50+ features
  • Multiple model comparison (tree-based + neural)
  • Proper walk-forward validation
  • Sentiment integration from news
  • Model interpretation & explainability (SHAP)
  • Production deployment with monitoring
  • Automated retraining pipeline

Let's Begin!

Start with Module 1: ML Concepts for Traders to understand why machine learning in finance requires special considerations.

Next: Module 1 - ML Concepts for Traders

Module 1: ML Concepts for Traders

Part 1: ML Fundamentals for Finance

Duration Exercises
~2.5 hours 6

Learning Objectives

By the end of this module, you will be able to:

  • Distinguish between supervised and unsupervised learning
  • Understand why financial ML faces unique challenges
  • Map the complete ML pipeline for trading applications
  • Set up your ML development environment

1.1 What is Machine Learning?

Machine learning is about building systems that learn patterns from data rather than being explicitly programmed. In trading, we use ML to find patterns that might predict future price movements.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Traditional programming vs Machine Learning

print("Traditional Programming:")
print("  Input: Data + Rules")
print("  Output: Answers")
print("  Example: IF price > SMA(20) THEN buy")

print("\nMachine Learning:")
print("  Input: Data + Answers (historical examples)")
print("  Output: Rules (learned patterns)")
print("  Example: Model learns what conditions precede profitable trades")

Supervised vs Unsupervised Learning

# Supervised Learning: We have labeled examples
# - Classification: Predict categories (up/down, buy/sell/hold)
# - Regression: Predict continuous values (tomorrow's return)

supervised_examples = {
    'Classification': [
        'Predict if stock goes up or down tomorrow',
        'Classify trades as profitable or not',
        'Detect market regime (bull/bear/sideways)'
    ],
    'Regression': [
        'Predict tomorrow\'s return magnitude',
        'Forecast volatility',
        'Estimate fair value'
    ]
}

# Unsupervised Learning: No labels, find structure in data
unsupervised_examples = {
    'Clustering': [
        'Group similar stocks together',
        'Identify market regimes',
        'Segment trading days by behavior'
    ],
    'Dimensionality Reduction': [
        'Reduce many correlated features to few factors',
        'Find latent market factors',
        'Compress feature space'
    ]
}

print("SUPERVISED LEARNING (with labels):")
for category, examples in supervised_examples.items():
    print(f"\n  {category}:")
    for ex in examples:
        print(f"    - {ex}")

print("\n" + "="*50)
print("\nUNSUPERVISED LEARNING (no labels):")
for category, examples in unsupervised_examples.items():
    print(f"\n  {category}:")
    for ex in examples:
        print(f"    - {ex}")

Training, Validation, and Testing

# The fundamental concept: Split your data

print("Data Split Strategy:")
print("="*50)
print("\n1. TRAINING SET (~60-70%)")
print("   - Model learns patterns from this data")
print("   - Like studying for an exam")

print("\n2. VALIDATION SET (~15-20%)")
print("   - Used to tune hyperparameters")
print("   - Like practice tests")

print("\n3. TEST SET (~15-20%)")
print("   - Final evaluation, NEVER used during training")
print("   - Like the final exam")

# Visual representation
fig, ax = plt.subplots(figsize=(12, 2))

# Draw the splits
ax.barh(0, 0.7, left=0, color='steelblue', label='Training (70%)')
ax.barh(0, 0.15, left=0.7, color='orange', label='Validation (15%)')
ax.barh(0, 0.15, left=0.85, color='green', label='Test (15%)')

ax.set_xlim(0, 1)
ax.set_ylim(-0.5, 0.5)
ax.set_yticks([])
ax.set_xlabel('Data Timeline')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5), ncol=3)
ax.set_title('Standard Data Split for Time Series')

plt.tight_layout()
plt.show()

1.2 Why Finance is Different

Financial markets present unique challenges that make ML much harder than in other domains.

# Challenge 1: Low Signal-to-Noise Ratio

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate a signal buried in noise
days = 252
signal = np.sin(np.linspace(0, 4*np.pi, days)) * 0.01  # Tiny signal
noise = np.random.normal(0, 0.02, days)  # Much larger noise
observed_returns = signal + noise

# Calculate signal-to-noise ratio
snr = np.std(signal) / np.std(noise)
print(f"Signal-to-Noise Ratio: {snr:.2%}")
print("The true signal is only ~50% as strong as the noise!")

fig, axes = plt.subplots(3, 1, figsize=(12, 8))

axes[0].plot(signal, 'g-', linewidth=2)
axes[0].set_title('True Signal (Hidden)')
axes[0].set_ylabel('Return')

axes[1].plot(noise, 'r-', alpha=0.7)
axes[1].set_title('Noise')
axes[1].set_ylabel('Return')

axes[2].plot(observed_returns, 'b-', alpha=0.7)
axes[2].set_title('What We Observe (Signal + Noise)')
axes[2].set_xlabel('Days')
axes[2].set_ylabel('Return')

plt.tight_layout()
plt.show()

print("\nImplication: Models easily fit to noise, not signal")
print("Solution: Regularization, cross-validation, feature selection")
# Challenge 2: Non-Stationarity

# Financial time series properties change over time
np.random.seed(42)

# Simulate changing volatility regimes
low_vol = np.random.normal(0.001, 0.01, 100)
high_vol = np.random.normal(-0.002, 0.03, 100)
medium_vol = np.random.normal(0.0005, 0.015, 100)

returns = np.concatenate([low_vol, high_vol, medium_vol])

fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Returns
axes[0].plot(returns)
axes[0].axvline(100, color='r', linestyle='--', alpha=0.7)
axes[0].axvline(200, color='r', linestyle='--', alpha=0.7)
axes[0].set_title('Returns with Changing Regimes')
axes[0].set_ylabel('Return')

# Rolling volatility
rolling_vol = pd.Series(returns).rolling(20).std()
axes[1].plot(rolling_vol, color='orange')
axes[1].axvline(100, color='r', linestyle='--', alpha=0.7)
axes[1].axvline(200, color='r', linestyle='--', alpha=0.7)
axes[1].set_title('Rolling Volatility (20-day)')
axes[1].set_xlabel('Days')
axes[1].set_ylabel('Volatility')

plt.tight_layout()
plt.show()

print("The same strategy that works in one regime may fail in another.")
print("A model trained on low-vol data will struggle in high-vol periods.")
# Challenge 3: Adversarial Environment

print("The Adversarial Nature of Markets")
print("="*50)
print("""
Unlike image classification or language translation:

1. OTHER TRADERS ARE ADAPTING
   - If you find a profitable pattern, others will too
   - As more money exploits a pattern, the edge disappears
   - "Alpha decay" - strategies lose effectiveness over time

2. THE SYSTEM FIGHTS BACK
   - Market makers adjust to trading patterns
   - Large trades move prices against you
   - Information gets priced in faster

3. REGIME CHANGES ARE STRUCTURAL
   - Regulations change market behavior
   - New instruments (ETFs, derivatives) change dynamics
   - Technology (HFT) changes market microstructure

This is fundamentally different from:
- Cats don't evolve to avoid being classified as cats
- Weather patterns don't adapt to forecasts
- Medical diagnoses don't change because you're predicting them
""")
# Challenge 4: Lookahead Bias - The Silent Killer

print("Lookahead Bias Examples")
print("="*50)

lookahead_examples = [
    {
        'mistake': 'Using adjusted close prices for historical signals',
        'problem': 'Adjustments happen after splits/dividends occur',
        'solution': 'Use unadjusted prices, apply adjustments at signal time'
    },
    {
        'mistake': 'Including future data in feature calculation',
        'problem': 'Rolling window includes today\'s close in today\'s signal',
        'solution': 'Use .shift(1) to lag features appropriately'
    },
    {
        'mistake': 'Survivorship bias in stock universe',
        'problem': 'Only including stocks that exist today',
        'solution': 'Use point-in-time constituent lists'
    },
    {
        'mistake': 'Using final earnings numbers',
        'problem': 'Earnings are often revised after initial release',
        'solution': 'Use point-in-time fundamental data'
    }
]

for i, example in enumerate(lookahead_examples, 1):
    print(f"\n{i}. {example['mistake']}")
    print(f"   Problem: {example['problem']}")
    print(f"   Solution: {example['solution']}")

Exercise 1.1: Identify ML Problem Types (Guided)

Classify financial ML problems as supervised/unsupervised and classification/regression.

Exercise
Solution 1.1
def classify_ml_problem(description: str) -> dict:
    """
    Classify an ML problem based on its description.

    Args:
        description: Description of the ML problem

    Returns:
        Dictionary with learning_type and task_type
    """
    description_lower = description.lower()

    # Determine if supervised or unsupervised
    supervised_keywords = ['predict', 'forecast', 'classify', 'whether', 'will']
    is_supervised = any(kw in description_lower for kw in supervised_keywords)

    # Determine task type
    classification_keywords = ['up', 'down', 'category', 'direction', 'whether']
    regression_keywords = ['how much', 'value', 'return', 'price', 'amount']

    is_classification = any(kw in description_lower for kw in classification_keywords)
    is_regression = any(kw in description_lower for kw in regression_keywords)

    # Build result
    result = {
        'learning_type': 'supervised' if is_supervised else 'unsupervised',
        'task_type': 'unknown'
    }

    if is_supervised:
        if is_classification and not is_regression:
            result['task_type'] = 'classification'
        elif is_regression and not is_classification:
            result['task_type'] = 'regression'
        else:
            result['task_type'] = 'could be either'
    else:
        result['task_type'] = 'clustering or dimensionality reduction'

    return result

1.3 The ML Pipeline

A systematic approach to building ML models for trading.

# The Complete ML Pipeline for Trading

class MLPipelineStage:
    """Represents a stage in the ML pipeline."""
    
    def __init__(self, name: str, description: str, key_considerations: list):
        self.name = name
        self.description = description
        self.key_considerations = key_considerations
    
    def display(self):
        print(f"\n{'='*60}")
        print(f"STAGE: {self.name}")
        print(f"{'='*60}")
        print(f"\n{self.description}")
        print("\nKey Considerations:")
        for consideration in self.key_considerations:
            print(f"  - {consideration}")


# Define pipeline stages
pipeline = [
    MLPipelineStage(
        "1. Data Collection",
        "Gather all relevant data sources for your trading strategy.",
        [
            "Price data (OHLCV) at appropriate frequency",
            "Fundamental data (earnings, ratios)",
            "Alternative data (sentiment, satellite)",
            "Ensure point-in-time accuracy"
        ]
    ),
    MLPipelineStage(
        "2. Data Preparation",
        "Clean and prepare data for ML consumption.",
        [
            "Handle missing values appropriately",
            "Adjust for corporate actions",
            "Time series train/test split (no random shuffle!)",
            "Check for lookahead bias"
        ]
    ),
    MLPipelineStage(
        "3. Feature Engineering",
        "Create predictive features from raw data.",
        [
            "Price-based features (returns, volatility)",
            "Technical indicators",
            "Statistical features (z-scores, percentiles)",
            "Feature selection to avoid overfitting"
        ]
    ),
    MLPipelineStage(
        "4. Target Engineering",
        "Define what you're trying to predict.",
        [
            "Return-based vs direction-based targets",
            "Triple barrier method for labeling",
            "Handle overlapping labels",
            "Sample weighting for uniqueness"
        ]
    ),
    MLPipelineStage(
        "5. Model Training",
        "Train and tune your ML model.",
        [
            "Select appropriate algorithm",
            "Hyperparameter tuning with CV",
            "Use time series cross-validation",
            "Ensemble methods for robustness"
        ]
    ),
    MLPipelineStage(
        "6. Evaluation",
        "Assess model performance rigorously.",
        [
            "ML metrics (accuracy, precision, recall)",
            "Financial metrics (Sharpe, returns)",
            "Walk-forward validation",
            "Statistical significance testing"
        ]
    ),
    MLPipelineStage(
        "7. Deployment",
        "Put the model into production.",
        [
            "Real-time feature calculation",
            "Model serving infrastructure",
            "Monitoring and alerting",
            "Retraining schedule"
        ]
    )
]

# Display all stages
for stage in pipeline:
    stage.display()
# Mini Example: Complete Pipeline Demo

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import yfinance as yf

print("Mini Pipeline Demo")
print("="*50)

# Stage 1: Data Collection
print("\n1. Collecting data...")
df = yf.Ticker("AAPL").history(period="2y")
print(f"   Downloaded {len(df)} rows")

# Stage 2: Data Preparation
print("\n2. Preparing data...")
df = df[['Open', 'High', 'Low', 'Close', 'Volume']].copy()
print(f"   Columns: {list(df.columns)}")

# Stage 3: Feature Engineering
print("\n3. Engineering features...")
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum'] = df['Close'].pct_change(10)
df['volume_change'] = df['Volume'].pct_change()
print(f"   Created 4 features")

# Stage 4: Target Engineering
print("\n4. Creating target...")
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
print(f"   Target: Next day direction (0=down, 1=up)")

# Prepare final dataset
features = ['returns', 'volatility', 'momentum', 'volume_change']
df_clean = df.dropna()

# Stage 5: Model Training (with time series split)
print("\n5. Training model...")
split_idx = int(len(df_clean) * 0.8)

X_train = df_clean[features][:split_idx]
y_train = df_clean['target'][:split_idx]
X_test = df_clean[features][split_idx:]
y_test = df_clean['target'][split_idx:]

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
print(f"   Trained on {len(X_train)} samples")

# Stage 6: Evaluation
print("\n6. Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"   Test accuracy: {accuracy:.2%}")
print(f"   Baseline (random): 50%")

# Stage 7: Deployment (just a preview)
print("\n7. Ready for deployment!")
print(f"   Model can predict direction based on 4 features")

Exercise 1.2: Pipeline Stage Matcher (Guided)

Match activities to the correct pipeline stage.

Exercise
Solution 1.2
def match_to_pipeline_stage(activity: str) -> str:
    """
    Match an activity to the correct ML pipeline stage.

    Args:
        activity: Description of the activity

    Returns:
        Name of the pipeline stage
    """
    activity_lower = activity.lower()

    # Define keywords for each stage
    stage_keywords = {
        'data_collection': ['download', 'fetch', 'api', 'source', 'gather'],
        'data_preparation': ['clean', 'missing', 'split', 'adjust', 'outlier'],
        'feature_engineering': ['indicator', 'feature', 'rolling', 'calculate', 'transform'],
        'target_engineering': ['label', 'target', 'predict what', 'direction', 'barrier'],
        'model_training': ['train', 'fit', 'hyperparameter', 'tune', 'algorithm'],
        'evaluation': ['accuracy', 'sharpe', 'test', 'metric', 'performance'],
        'deployment': ['production', 'real-time', 'monitor', 'serve', 'deploy']
    }

    # Find the matching stage
    for stage, keywords in stage_keywords.items():
        if any(kw in activity_lower for kw in keywords):
            return stage.replace('_', ' ').title()

    return 'Unknown Stage'

1.4 Tools Setup

Setting up your ML development environment for financial applications.

# Core ML Libraries

print("Essential Libraries for Financial ML")
print("="*50)

libraries = {
    'Data Manipulation': {
        'pandas': 'DataFrames for tabular data',
        'numpy': 'Numerical computing',
    },
    'Machine Learning': {
        'scikit-learn': 'Core ML algorithms and utilities',
        'xgboost': 'Gradient boosting (fast, accurate)',
        'lightgbm': 'Another gradient boosting option',
    },
    'Deep Learning': {
        'torch (PyTorch)': 'Neural networks',
        'tensorflow': 'Alternative to PyTorch',
    },
    'Visualization': {
        'matplotlib': 'Static plots',
        'seaborn': 'Statistical visualizations',
        'plotly': 'Interactive plots',
    },
    'Financial Data': {
        'yfinance': 'Yahoo Finance data',
        'pandas-datareader': 'Multiple data sources',
    },
    'NLP & Sentiment': {
        'nltk': 'Natural language processing',
        'transformers': 'Pre-trained language models',
    },
    'Model Interpretation': {
        'shap': 'SHAP values for explainability',
        'lime': 'Local interpretable explanations',
    }
}

for category, libs in libraries.items():
    print(f"\n{category}:")
    for lib, desc in libs.items():
        print(f"  {lib}: {desc}")
# Check your environment

def check_library(name: str) -> tuple:
    """Check if a library is installed and get its version."""
    try:
        module = __import__(name.replace('-', '_'))
        version = getattr(module, '__version__', 'unknown')
        return True, version
    except ImportError:
        return False, None

print("Environment Check")
print("="*50)

essential_libs = ['pandas', 'numpy', 'sklearn', 'matplotlib', 'yfinance']
optional_libs = ['xgboost', 'lightgbm', 'torch', 'shap']

print("\nEssential Libraries:")
for lib in essential_libs:
    installed, version = check_library(lib)
    status = f"v{version}" if installed else "NOT INSTALLED"
    symbol = "OK" if installed else "MISSING"
    print(f"  [{symbol}] {lib}: {status}")

print("\nOptional Libraries:")
for lib in optional_libs:
    installed, version = check_library(lib)
    status = f"v{version}" if installed else "not installed"
    symbol = "OK" if installed else "--"
    print(f"  [{symbol}] {lib}: {status}")
# Jupyter Notebook Best Practices for ML

print("Best Practices for ML in Jupyter")
print("="*50)

best_practices = [
    ("Set random seeds", "np.random.seed(42) for reproducibility"),
    ("Suppress warnings carefully", "warnings.filterwarnings('ignore') when appropriate"),
    ("Use autoreload", "%load_ext autoreload for module development"),
    ("Track experiments", "Log hyperparameters and results systematically"),
    ("Checkpoint models", "Save model state periodically"),
    ("Memory management", "Delete large objects with del, use gc.collect()"),
    ("Version control", "Strip outputs before committing notebooks")
]

for practice, explanation in best_practices:
    print(f"\n{practice}:")
    print(f"  {explanation}")

Exercise 1.3: Environment Validator (Guided)

Create a function that validates the ML environment is ready.

Exercise
Solution 1.3
def validate_ml_environment(required_libs: List[str], optional_libs: List[str] = None) -> Dict:
    """
    Validate that required ML libraries are installed.

    Args:
        required_libs: List of required library names
        optional_libs: List of optional library names

    Returns:
        Dictionary with validation results
    """
    if optional_libs is None:
        optional_libs = []

    results = {
        'required': {},
        'optional': {},
        'ready': True,
        'missing_required': []
    }

    # Check required libraries
    for lib in required_libs:
        try:
            module = __import__(lib.replace('-', '_'))
            version = getattr(module, '__version__', 'installed')
            results['required'][lib] = {'installed': True, 'version': version}
        except ImportError:
            results['required'][lib] = {'installed': False, 'version': None}
            results['ready'] = False
            results['missing_required'].append(lib)

    # Check optional libraries
    for lib in optional_libs:
        try:
            module = __import__(lib.replace('-', '_'))
            version = getattr(module, '__version__', 'installed')
            results['optional'][lib] = {'installed': True, 'version': version}
        except ImportError:
            results['optional'][lib] = {'installed': False, 'version': None}

    return results

Open-Ended Exercises

Exercise 1.4: Financial ML Challenges Analysis (Open-ended)

Create a comprehensive analysis of why a specific ML model might fail in financial markets.

Exercise
Solution 1.4
def analyze_ml_challenges(df: pd.DataFrame, return_col: str = 'returns') -> dict:
    """
    Analyze a dataset for potential ML challenges.

    Args:
        df: DataFrame with financial data
        return_col: Name of the returns column

    Returns:
        Dictionary with challenge analysis
    """
    report = {
        'signal_to_noise': {},
        'regime_stability': {},
        'potential_lookahead': [],
        'recommendations': []
    }

    # Signal-to-noise analysis
    if return_col in df.columns:
        returns = df[return_col].dropna()
        mean_return = returns.mean()
        std_return = returns.std()
        snr = abs(mean_return) / std_return if std_return > 0 else 0

        report['signal_to_noise'] = {
            'mean_return': mean_return,
            'std_return': std_return,
            'ratio': snr,
            'assessment': 'Low' if snr < 0.1 else 'Moderate' if snr < 0.2 else 'Good'
        }

        if snr < 0.1:
            report['recommendations'].append(
                "Very low signal-to-noise. Consider strong regularization."
            )

    # Regime stability
    mid_point = len(df) // 2
    first_half = df[return_col][:mid_point].dropna() if return_col in df.columns else None
    second_half = df[return_col][mid_point:].dropna() if return_col in df.columns else None

    if first_half is not None and len(first_half) > 0:
        vol_ratio = second_half.std() / first_half.std()
        report['regime_stability'] = {
            'first_half_vol': first_half.std(),
            'second_half_vol': second_half.std(),
            'volatility_ratio': vol_ratio,
            'stable': 0.5 < vol_ratio < 2.0
        }

        if not report['regime_stability']['stable']:
            report['recommendations'].append(
                "Significant regime change detected. Consider regime-aware models."
            )

    # Lookahead bias check
    suspicious_keywords = ['future', 'next', 'forward', 'target', 'label']
    for col in df.columns:
        if any(kw in col.lower() for kw in suspicious_keywords):
            report['potential_lookahead'].append(col)

    if report['potential_lookahead']:
        report['recommendations'].append(
            f"Columns with potential lookahead: {report['potential_lookahead']}. Verify timing."
        )

    return report

# Test
import yfinance as yf
df = yf.Ticker("SPY").history(period="2y")
df['returns'] = df['Close'].pct_change()
df['future_return'] = df['returns'].shift(-1)  # Intentional lookahead

analysis = analyze_ml_challenges(df, 'returns')

print("ML Challenges Analysis")
print("="*50)
print(f"\nSignal-to-Noise: {analysis['signal_to_noise']['assessment']}")
print(f"  Ratio: {analysis['signal_to_noise']['ratio']:.4f}")
print(f"\nRegime Stability: {'Stable' if analysis['regime_stability'].get('stable') else 'Unstable'}")
print(f"  Vol Ratio: {analysis['regime_stability'].get('volatility_ratio', 'N/A'):.2f}")
print(f"\nPotential Lookahead Issues: {analysis['potential_lookahead']}")
print(f"\nRecommendations:")
for rec in analysis['recommendations']:
    print(f"  - {rec}")

Exercise 1.5: ML Pipeline Builder (Open-ended)

Create a class that represents and validates an ML pipeline configuration.

Exercise
Solution 1.5
class MLPipelineConfig:
    """
    Configuration manager for ML pipelines.
    """

    REQUIRED_STAGES = ['data', 'features', 'target', 'model', 'evaluation']

    def __init__(self):
        self.config = {}
        self.validation_errors = []

    def add_stage(self, stage_name: str, config: dict) -> 'MLPipelineConfig':
        """
        Add a stage configuration.

        Args:
            stage_name: Name of the pipeline stage
            config: Configuration dictionary for the stage

        Returns:
            Self for method chaining
        """
        if 'method' not in config:
            self.validation_errors.append(
                f"Stage '{stage_name}' missing required 'method' key"
            )
        self.config[stage_name] = config
        return self

    def validate(self) -> bool:
        """
        Validate the pipeline configuration.

        Returns:
            True if valid, False otherwise
        """
        self.validation_errors = []

        # Check required stages
        for stage in self.REQUIRED_STAGES:
            if stage not in self.config:
                self.validation_errors.append(f"Missing required stage: {stage}")

        # Check for common errors
        if 'data' in self.config:
            data_config = self.config['data']
            if data_config.get('test_size', 0) > 0.5:
                self.validation_errors.append(
                    "Warning: Test size > 50% may leave insufficient training data"
                )

        if 'model' in self.config:
            model_config = self.config['model']
            if model_config.get('cv_method') == 'random':
                self.validation_errors.append(
                    "Warning: Random CV is inappropriate for time series. Use TimeSeriesSplit."
                )

        return len(self.validation_errors) == 0

    def get_summary(self) -> str:
        """
        Generate a summary report of the pipeline.

        Returns:
            Summary string
        """
        lines = ["ML Pipeline Configuration Summary", "=" * 40]

        for stage, config in self.config.items():
            lines.append(f"\n{stage.upper()}:")
            for key, value in config.items():
                lines.append(f"  {key}: {value}")

        is_valid = self.validate()
        lines.append(f"\nValidation: {'PASSED' if is_valid else 'FAILED'}")

        if self.validation_errors:
            lines.append("\nIssues:")
            for error in self.validation_errors:
                lines.append(f"  - {error}")

        return "\n".join(lines)

# Test the pipeline builder
pipeline = MLPipelineConfig()
pipeline.add_stage('data', {
    'method': 'yfinance',
    'symbols': ['SPY', 'AAPL'],
    'period': '2y',
    'test_size': 0.2
})
pipeline.add_stage('features', {
    'method': 'technical_indicators',
    'indicators': ['sma', 'rsi', 'macd']
})
pipeline.add_stage('target', {
    'method': 'direction',
    'horizon': 1
})
pipeline.add_stage('model', {
    'method': 'random_forest',
    'cv_method': 'time_series',
    'n_splits': 5
})
pipeline.add_stage('evaluation', {
    'method': 'classification_metrics',
    'metrics': ['accuracy', 'f1', 'sharpe']
})

print(pipeline.get_summary())

Exercise 1.6: Complete ML Workflow Skeleton (Open-ended)

Create a complete but minimal ML workflow for trading that demonstrates all pipeline stages.

Exercise
Solution 1.6
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
import yfinance as yf

class TradingMLWorkflow:
    """
    Complete ML workflow for trading applications.
    """

    def __init__(self, symbol: str, period: str = '2y'):
        self.symbol = symbol
        self.period = period
        self.data = None
        self.model = None
        self.results = None

    def fetch_data(self) -> pd.DataFrame:
        """Stage 1: Data Collection"""
        ticker = yf.Ticker(self.symbol)
        self.data = ticker.history(period=self.period)
        return self.data

    def engineer_features(self) -> pd.DataFrame:
        """Stage 2 & 3: Data Preparation & Feature Engineering"""
        df = self.data.copy()

        # Basic features
        df['returns'] = df['Close'].pct_change()
        df['volatility'] = df['returns'].rolling(20).std()
        df['momentum_5'] = df['Close'].pct_change(5)
        df['momentum_20'] = df['Close'].pct_change(20)
        df['volume_change'] = df['Volume'].pct_change()

        # Distance from moving averages
        df['sma_20'] = df['Close'].rolling(20).mean()
        df['dist_sma'] = (df['Close'] - df['sma_20']) / df['sma_20']

        self.data = df
        return df

    def create_target(self, horizon: int = 1) -> pd.DataFrame:
        """Stage 4: Target Engineering"""
        df = self.data.copy()
        df['target'] = (df['returns'].shift(-horizon) > 0).astype(int)
        self.data = df
        return df

    def train_model(self, test_size: float = 0.2) -> dict:
        """Stage 5 & 6: Model Training & Evaluation"""
        # Prepare data
        feature_cols = ['returns', 'volatility', 'momentum_5', 'momentum_20', 
                       'volume_change', 'dist_sma']
        df_clean = self.data.dropna()

        X = df_clean[feature_cols]
        y = df_clean['target']

        # Time series split
        split_idx = int(len(X) * (1 - test_size))
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]

        # Train
        self.model = RandomForestClassifier(
            n_estimators=100, 
            max_depth=5, 
            random_state=42
        )
        self.model.fit(X_train, y_train)

        # Predict
        y_pred = self.model.predict(X_test)
        y_prob = self.model.predict_proba(X_test)[:, 1]

        # Calculate metrics
        self.results = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'train_size': len(X_train),
            'test_size': len(X_test),
            'feature_importance': dict(zip(feature_cols, self.model.feature_importances_)),
            'predictions': pd.DataFrame({
                'actual': y_test.values,
                'predicted': y_pred,
                'probability': y_prob
            }, index=y_test.index)
        }

        return self.results

    def run_full_pipeline(self) -> dict:
        """Execute the complete pipeline."""
        print(f"Running ML Pipeline for {self.symbol}")
        print("=" * 50)

        print("\n1. Fetching data...")
        self.fetch_data()
        print(f"   Downloaded {len(self.data)} rows")

        print("\n2. Engineering features...")
        self.engineer_features()
        print(f"   Created 6 features")

        print("\n3. Creating target...")
        self.create_target()
        print(f"   Target: next-day direction")

        print("\n4. Training and evaluating...")
        results = self.train_model()

        print(f"\n" + "=" * 50)
        print("RESULTS:")
        print(f"  Accuracy: {results['accuracy']:.2%}")
        print(f"  Precision: {results['precision']:.2%}")
        print(f"  Recall: {results['recall']:.2%}")
        print(f"\nTop Features:")
        sorted_features = sorted(
            results['feature_importance'].items(), 
            key=lambda x: -x[1]
        )
        for feat, imp in sorted_features[:3]:
            print(f"  {feat}: {imp:.3f}")

        return results

# Run the workflow
workflow = TradingMLWorkflow('SPY', '2y')
results = workflow.run_full_pipeline()

Module Project: ML Project Template

Create a reusable project template that sets up the structure for any financial ML project.

# Module Project: ML Project Template

import os
from datetime import datetime
from typing import Dict, List, Optional
import json


class MLProjectTemplate:
    """
    Creates a standardized project structure for financial ML projects.
    
    This template ensures consistent organization and includes
    all necessary components for reproducible ML experiments.
    """
    
    def __init__(self, project_name: str, description: str = ""):
        """
        Initialize a new ML project template.
        
        Args:
            project_name: Name of the project
            description: Brief description of the project
        """
        self.project_name = project_name
        self.description = description
        self.created_at = datetime.now().isoformat()
        self.config = self._default_config()
        self.directory_structure = self._default_structure()
    
    def _default_config(self) -> Dict:
        """Return default project configuration."""
        return {
            'data': {
                'source': 'yfinance',
                'symbols': [],
                'period': '2y',
                'test_size': 0.2,
                'validation_size': 0.1
            },
            'features': {
                'price_based': ['returns', 'volatility', 'momentum'],
                'technical': ['sma', 'rsi', 'macd'],
                'scaling': 'standard'
            },
            'target': {
                'type': 'classification',
                'method': 'direction',
                'horizon': 1
            },
            'model': {
                'algorithm': 'random_forest',
                'cv_method': 'time_series',
                'n_splits': 5,
                'hyperparameters': {}
            },
            'evaluation': {
                'ml_metrics': ['accuracy', 'precision', 'recall', 'f1'],
                'financial_metrics': ['sharpe', 'returns', 'max_drawdown']
            },
            'random_seed': 42
        }
    
    def _default_structure(self) -> Dict:
        """Return default directory structure."""
        return {
            'data': {
                'raw': 'Original data files',
                'processed': 'Cleaned and transformed data',
                'features': 'Engineered features'
            },
            'notebooks': {
                '01_data_exploration.ipynb': 'EDA notebook',
                '02_feature_engineering.ipynb': 'Feature creation',
                '03_model_training.ipynb': 'Model development',
                '04_evaluation.ipynb': 'Results analysis'
            },
            'src': {
                'data': 'Data loading and processing modules',
                'features': 'Feature engineering code',
                'models': 'Model definitions',
                'evaluation': 'Evaluation utilities'
            },
            'models': 'Saved model files',
            'reports': 'Generated reports and visualizations',
            'config': 'Configuration files'
        }
    
    def set_symbols(self, symbols: List[str]) -> 'MLProjectTemplate':
        """Set the trading symbols for the project."""
        self.config['data']['symbols'] = symbols
        return self
    
    def set_model(self, algorithm: str, **kwargs) -> 'MLProjectTemplate':
        """Set the model algorithm and hyperparameters."""
        self.config['model']['algorithm'] = algorithm
        self.config['model']['hyperparameters'] = kwargs
        return self
    
    def set_target(self, target_type: str, method: str, horizon: int = 1) -> 'MLProjectTemplate':
        """Set the prediction target configuration."""
        self.config['target'] = {
            'type': target_type,
            'method': method,
            'horizon': horizon
        }
        return self
    
    def validate_config(self) -> Dict:
        """
        Validate the project configuration.
        
        Returns:
            Dictionary with validation results and warnings
        """
        results = {
            'valid': True,
            'errors': [],
            'warnings': []
        }
        
        # Check required fields
        if not self.config['data']['symbols']:
            results['warnings'].append("No symbols specified")
        
        # Check for common mistakes
        test_size = self.config['data']['test_size']
        val_size = self.config['data']['validation_size']
        if test_size + val_size > 0.5:
            results['warnings'].append(
                f"Test + validation = {test_size + val_size:.0%}, leaving only "
                f"{1 - test_size - val_size:.0%} for training"
            )
        
        # Check model configuration
        if self.config['model']['cv_method'] == 'random':
            results['errors'].append(
                "Random CV is inappropriate for time series data"
            )
            results['valid'] = False
        
        return results
    
    def generate_readme(self) -> str:
        """Generate a README for the project."""
        readme = f"""
# {self.project_name}

{self.description}

Created: {self.created_at}

## Configuration

### Data
- Source: {self.config['data']['source']}
- Symbols: {', '.join(self.config['data']['symbols']) or 'Not specified'}
- Period: {self.config['data']['period']}

### Model
- Algorithm: {self.config['model']['algorithm']}
- CV Method: {self.config['model']['cv_method']}

### Target
- Type: {self.config['target']['type']}
- Method: {self.config['target']['method']}
- Horizon: {self.config['target']['horizon']} day(s)

## Directory Structure

```
{self.project_name}/
├── data/
│   ├── raw/
│   ├── processed/
│   └── features/
├── notebooks/
├── src/
│   ├── data/
│   ├── features/
│   ├── models/
│   └── evaluation/
├── models/
├── reports/
└── config/
```

## Usage

1. Start with `notebooks/01_data_exploration.ipynb`
2. Engineer features in `notebooks/02_feature_engineering.ipynb`
3. Train models in `notebooks/03_model_training.ipynb`
4. Evaluate results in `notebooks/04_evaluation.ipynb`
"""
        return readme.strip()
    
    def get_summary(self) -> str:
        """Get a summary of the project template."""
        validation = self.validate_config()
        
        summary = f"""
{'='*60}
ML PROJECT TEMPLATE: {self.project_name}
{'='*60}

Description: {self.description or 'Not provided'}
Created: {self.created_at}

CONFIGURATION:
  Data Source: {self.config['data']['source']}
  Symbols: {self.config['data']['symbols'] or 'Not set'}
  Model: {self.config['model']['algorithm']}
  Target: {self.config['target']['method']} ({self.config['target']['type']})

VALIDATION: {'PASSED' if validation['valid'] else 'FAILED'}
"""
        
        if validation['errors']:
            summary += "\nERRORS:\n"
            for error in validation['errors']:
                summary += f"  - {error}\n"
        
        if validation['warnings']:
            summary += "\nWARNINGS:\n"
            for warning in validation['warnings']:
                summary += f"  - {warning}\n"
        
        return summary


# Demo the project template
print("Creating ML Project Template...\n")

project = MLProjectTemplate(
    project_name="SPY Direction Predictor",
    description="Predict next-day direction of SPY using technical features"
)

# Configure the project
project.set_symbols(['SPY'])
project.set_model(
    'random_forest',
    n_estimators=100,
    max_depth=5
)
project.set_target('classification', 'direction', horizon=1)

# Display summary
print(project.get_summary())

# Show README preview
print("\n" + "="*60)
print("README PREVIEW:")
print("="*60)
print(project.generate_readme()[:1000] + "...")

Key Takeaways

  1. ML Types: Supervised learning (classification/regression) predicts with labels; unsupervised finds patterns without labels

  2. Finance is Different: Low signal-to-noise, non-stationarity, regime changes, and adversarial environment make financial ML uniquely challenging

  3. The Pipeline: Data → Features → Target → Model → Evaluation → Deployment - each stage requires finance-specific considerations

  4. Lookahead Bias: The most common and deadly mistake - always verify your features don't include future information

  5. Time Series Split: Never randomly shuffle financial data - always maintain temporal order in train/test splits

  6. Tool Ecosystem: scikit-learn, xgboost, and pandas form the core; add specialized tools as needed


Next: Module 2 - Data Preparation

Learn how to properly prepare financial data for ML, including handling missing values, train-test splits for time series, and avoiding common data leakage pitfalls.

Module 2: Data Preparation

Part 1: ML Fundamentals for Finance

Duration Exercises
~2.5 hours 6

Learning Objectives

By the end of this module, you will be able to:

  • Clean financial data while preserving signal integrity
  • Implement proper train-test splits for time series
  • Use time series cross-validation techniques
  • Handle imbalanced class distributions in trading data

2.1 Financial Data Cleaning

Financial data has unique cleaning requirements that differ from general data science.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# Download sample data
print("Downloading sample data...")
df = yf.Ticker("AAPL").history(period="2y")
print(f"Downloaded {len(df)} rows")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDate range: {df.index[0].date()} to {df.index[-1].date()}")
# Check for missing data

def analyze_missing_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze missing data patterns in a DataFrame.
    
    Args:
        df: Input DataFrame
        
    Returns:
        DataFrame with missing data statistics
    """
    missing_stats = pd.DataFrame({
        'missing_count': df.isnull().sum(),
        'missing_pct': df.isnull().sum() / len(df) * 100,
        'dtype': df.dtypes
    })
    return missing_stats.sort_values('missing_pct', ascending=False)

print("Missing Data Analysis:")
print(analyze_missing_data(df))
# Handling missing values in financial data

class FinancialDataCleaner:
    """
    Clean financial time series data with appropriate methods.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.cleaning_log = []
    
    def handle_missing_prices(self, method: str = 'ffill') -> 'FinancialDataCleaner':
        """
        Handle missing price data.
        
        For OHLC data, forward fill is usually appropriate as it
        represents "last known price" - what you'd actually trade at.
        """
        price_cols = ['Open', 'High', 'Low', 'Close']
        price_cols = [c for c in price_cols if c in self.df.columns]
        
        before_missing = self.df[price_cols].isnull().sum().sum()
        
        if method == 'ffill':
            self.df[price_cols] = self.df[price_cols].ffill()
        elif method == 'interpolate':
            self.df[price_cols] = self.df[price_cols].interpolate(method='time')
        
        after_missing = self.df[price_cols].isnull().sum().sum()
        self.cleaning_log.append(
            f"Filled {before_missing - after_missing} missing price values using {method}"
        )
        
        return self
    
    def handle_missing_volume(self, fill_value: float = 0) -> 'FinancialDataCleaner':
        """
        Handle missing volume data.
        
        Missing volume often means no trades occurred, so filling with 0 is reasonable.
        """
        if 'Volume' in self.df.columns:
            before_missing = self.df['Volume'].isnull().sum()
            self.df['Volume'] = self.df['Volume'].fillna(fill_value)
            self.cleaning_log.append(
                f"Filled {before_missing} missing volume values with {fill_value}"
            )
        return self
    
    def detect_outliers(self, column: str, n_std: float = 5) -> pd.Series:
        """
        Detect outliers using standard deviation method.
        
        For financial data, 5 std is a reasonable threshold as
        returns can legitimately be extreme.
        """
        if column not in self.df.columns:
            return pd.Series(dtype=bool)
        
        mean = self.df[column].mean()
        std = self.df[column].std()
        
        lower_bound = mean - n_std * std
        upper_bound = mean + n_std * std
        
        is_outlier = (self.df[column] < lower_bound) | (self.df[column] > upper_bound)
        
        self.cleaning_log.append(
            f"Found {is_outlier.sum()} outliers in {column} (>{n_std} std)"
        )
        
        return is_outlier
    
    def remove_zero_volume_days(self) -> 'FinancialDataCleaner':
        """
        Remove days with zero volume (likely data errors or holidays).
        """
        if 'Volume' in self.df.columns:
            before_len = len(self.df)
            self.df = self.df[self.df['Volume'] > 0]
            removed = before_len - len(self.df)
            self.cleaning_log.append(f"Removed {removed} zero-volume days")
        return self
    
    def get_cleaned_data(self) -> pd.DataFrame:
        """Return the cleaned DataFrame."""
        return self.df
    
    def get_cleaning_report(self) -> str:
        """Return a report of all cleaning operations."""
        report = "Data Cleaning Report\n" + "="*40 + "\n"
        for entry in self.cleaning_log:
            report += f"- {entry}\n"
        return report


# Apply cleaning
cleaner = FinancialDataCleaner(df)
cleaner.handle_missing_prices('ffill')
cleaner.handle_missing_volume(0)
cleaner.remove_zero_volume_days()

# Add returns and check for outliers
df_clean = cleaner.get_cleaned_data()
df_clean['returns'] = df_clean['Close'].pct_change()
cleaner.df = df_clean
outliers = cleaner.detect_outliers('returns', n_std=5)

print(cleaner.get_cleaning_report())
print(f"\nFinal dataset: {len(df_clean)} rows")

Exercise 2.1: Data Quality Checker (Guided)

Build a function to assess data quality for ML readiness.

Exercise
Solution 2.1
def check_data_quality(df: pd.DataFrame, price_col: str = 'Close') -> dict:
    """
    Check data quality for ML readiness.
    """
    quality = {
        'row_count': len(df),
        'date_range': None,
        'missing_data': {},
        'issues': [],
        'ready_for_ml': True
    }

    # Calculate date range
    if hasattr(df.index, 'min') and hasattr(df.index, 'max'):
        quality['date_range'] = {
            'start': str(df.index.min()),
            'end': str(df.index.max())
        }

    # Calculate missing data percentage for each column
    for col in df.columns:
        missing_pct = df[col].isnull().sum() / len(df) * 100
        quality['missing_data'][col] = round(missing_pct, 2)

        if missing_pct > 5:
            quality['issues'].append(f"{col} has {missing_pct:.1f}% missing data")
            quality['ready_for_ml'] = False

    # Check for minimum data requirements
    if len(df) < 252:
        quality['issues'].append("Less than 252 rows (1 year of trading days)")
        quality['ready_for_ml'] = False

    return quality

2.2 Train-Test Split for Time Series

Critical: Never randomly shuffle time series data. Always maintain temporal order.

# Why random split fails for time series

print("Random Split vs Time Series Split")
print("="*50)

print("""
RANDOM SPLIT (WRONG for time series):
┌─────────────────────────────────────────┐
│ Train Train Test Train Test Train Test  │
│   ↑      ↑     ↑     ↑     ↑     ↑   ↑  │
│   Randomly scattered across time        │
└─────────────────────────────────────────┘
Problem: Model can "peek" at future data during training!

TIME SERIES SPLIT (CORRECT):
┌─────────────────────────────────────────┐
│ Train Train Train Train │ Test Test Test│
│ ←─── Earlier ──→        │ ←─ Later ──→  │
└─────────────────────────────────────────┘
Correct: Model only trained on past, tested on future.
""")
# Proper time series split implementation

from typing import Tuple

def time_series_split(
    df: pd.DataFrame,
    test_size: float = 0.2,
    validation_size: float = 0.1
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Split time series data maintaining temporal order.
    
    Args:
        df: DataFrame sorted by date
        test_size: Proportion for test set (most recent data)
        validation_size: Proportion for validation set
        
    Returns:
        Tuple of (train, validation, test) DataFrames
    """
    n = len(df)
    
    # Calculate split indices
    test_start = int(n * (1 - test_size))
    val_start = int(n * (1 - test_size - validation_size))
    
    # Split
    train = df.iloc[:val_start]
    validation = df.iloc[val_start:test_start]
    test = df.iloc[test_start:]
    
    return train, validation, test


# Apply split
train, val, test = time_series_split(df_clean, test_size=0.2, validation_size=0.1)

print("Time Series Split Results:")
print("="*50)
print(f"\nTraining:   {len(train):4d} rows ({len(train)/len(df_clean)*100:.1f}%)")
print(f"  Period: {train.index[0].date()} to {train.index[-1].date()}")
print(f"\nValidation: {len(val):4d} rows ({len(val)/len(df_clean)*100:.1f}%)")
print(f"  Period: {val.index[0].date()} to {val.index[-1].date()}")
print(f"\nTest:       {len(test):4d} rows ({len(test)/len(df_clean)*100:.1f}%)")
print(f"  Period: {test.index[0].date()} to {test.index[-1].date()}")
# Visualize the split

fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(train.index, train['Close'], 'b-', label='Training', linewidth=1)
ax.plot(val.index, val['Close'], 'orange', label='Validation', linewidth=1)
ax.plot(test.index, test['Close'], 'g-', label='Test', linewidth=1)

# Add vertical lines at split points
ax.axvline(val.index[0], color='gray', linestyle='--', alpha=0.7)
ax.axvline(test.index[0], color='gray', linestyle='--', alpha=0.7)

ax.set_title('Time Series Train/Validation/Test Split')
ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Purging and Embargo

When features or labels overlap in time, we need additional safeguards.

# Purging and Embargo explained

print("Purging and Embargo")
print("="*50)

print("""
PROBLEM: Labels often span multiple days (e.g., 5-day returns)

Without purging:
┌────────────────────────────────────────────────────┐
│ Day 1   Day 2   Day 3 │ Day 4   Day 5   Day 6     │
│ ←─── Training ────→   │ ←──── Test ────→          │
│        └───────────────┘                          │
│        Label for Day 3 includes Day 4 & 5!        │
└────────────────────────────────────────────────────┘
LEAKAGE: Training label contains test period info

With purging (remove overlap):
┌────────────────────────────────────────────────────┐
│ Day 1   Day 2 │ PURGED │ Day 4   Day 5   Day 6    │
│ ←─ Training ─→│        │ ←──── Test ────→         │
└────────────────────────────────────────────────────┘
SAFE: Gap prevents information leakage

Embargo adds extra buffer after test start:
┌────────────────────────────────────────────────────┐
│ Day 1 │ PURGED │ EMBARGO │ Day 5   Day 6   Day 7  │
│ Train │        │         │ ←──── Test ────→       │
└────────────────────────────────────────────────────┘
""")

def apply_purge_embargo(
    train_idx: pd.DatetimeIndex,
    test_idx: pd.DatetimeIndex,
    label_horizon: int = 5,
    embargo_days: int = 1
) -> pd.DatetimeIndex:
    """
    Remove training samples that overlap with test period.
    
    Args:
        train_idx: Training data index
        test_idx: Test data index
        label_horizon: How many days the label spans
        embargo_days: Additional buffer days
        
    Returns:
        Purged training index
    """
    test_start = test_idx.min()
    
    # Purge: Remove samples whose labels would overlap with test
    purge_cutoff = test_start - pd.Timedelta(days=label_horizon)
    
    # Embargo: Additional buffer
    embargo_cutoff = purge_cutoff - pd.Timedelta(days=embargo_days)
    
    purged_idx = train_idx[train_idx < embargo_cutoff]
    
    return purged_idx

# Example
purged_train_idx = apply_purge_embargo(
    train.index, 
    test.index, 
    label_horizon=5,
    embargo_days=1
)

print(f"\nOriginal training samples: {len(train)}")
print(f"After purge + embargo:     {len(purged_train_idx)}")
print(f"Removed:                   {len(train) - len(purged_train_idx)} samples")

Exercise 2.2: Time Series Splitter (Guided)

Build a comprehensive time series splitter class.

Exercise
Solution 2.2
class TimeSeriesSplitter:
    """
    Handles time series data splitting with purging and embargo.
    """

    def __init__(self, test_size: float = 0.2, purge_days: int = 0, embargo_days: int = 0):
        self.test_size = test_size
        self.purge_days = purge_days
        self.embargo_days = embargo_days

    def split(self, df: pd.DataFrame) -> dict:
        """
        Split data into train and test sets.
        """
        n = len(df)

        # Calculate split index
        split_idx = int(n * (1 - self.test_size))

        # Initial split
        train = df.iloc[:split_idx].copy()
        test = df.iloc[split_idx:].copy()

        # Apply purge and embargo
        if self.purge_days > 0 or self.embargo_days > 0:
            total_gap = self.purge_days + self.embargo_days
            train = train.iloc[:-total_gap] if total_gap > 0 else train

        return {
            'train': train,
            'test': test,
            'train_size': len(train),
            'test_size': len(test),
            'split_date': df.index[split_idx]
        }

2.3 Cross-Validation for Finance

Standard k-fold cross-validation doesn't work for time series.

# Time Series Cross-Validation with sklearn

from sklearn.model_selection import TimeSeriesSplit

# Create sample feature matrix
df_features = df_clean.copy()
df_features['returns'] = df_features['Close'].pct_change()
df_features['volatility'] = df_features['returns'].rolling(20).std()
df_features['momentum'] = df_features['Close'].pct_change(10)
df_features['target'] = (df_features['returns'].shift(-1) > 0).astype(int)
df_features = df_features.dropna()

X = df_features[['returns', 'volatility', 'momentum']]
y = df_features['target']

# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)

print("Time Series Cross-Validation Folds:")
print("="*60)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    train_start = X.index[train_idx[0]].date()
    train_end = X.index[train_idx[-1]].date()
    test_start = X.index[test_idx[0]].date()
    test_end = X.index[test_idx[-1]].date()
    
    print(f"\nFold {fold}:")
    print(f"  Train: {train_start} to {train_end} ({len(train_idx)} samples)")
    print(f"  Test:  {test_start} to {test_end} ({len(test_idx)} samples)")
# Visualize the CV folds

fig, axes = plt.subplots(5, 1, figsize=(14, 8), sharex=True)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    ax = axes[fold]
    
    # Plot training period
    train_dates = X.index[train_idx]
    test_dates = X.index[test_idx]
    
    ax.fill_between(train_dates, 0, 1, alpha=0.3, color='blue', label='Train')
    ax.fill_between(test_dates, 0, 1, alpha=0.3, color='green', label='Test')
    
    ax.set_ylabel(f'Fold {fold+1}')
    ax.set_yticks([])
    ax.set_xlim(X.index[0], X.index[-1])
    
    if fold == 0:
        ax.legend(loc='upper left')

axes[-1].set_xlabel('Date')
fig.suptitle('Time Series Cross-Validation Folds', fontsize=12)

plt.tight_layout()
plt.show()
# Run cross-validation with a model

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
tscv = TimeSeriesSplit(n_splits=5)

cv_results = []

print("Cross-Validation Results:")
print("="*50)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    cv_results.append({'fold': fold, 'accuracy': acc, 'f1': f1})
    print(f"Fold {fold}: Accuracy={acc:.3f}, F1={f1:.3f}")

# Summary
results_df = pd.DataFrame(cv_results)
print(f"\nMean Accuracy: {results_df['accuracy'].mean():.3f} (+/- {results_df['accuracy'].std():.3f})")
print(f"Mean F1:       {results_df['f1'].mean():.3f} (+/- {results_df['f1'].std():.3f})")

Exercise 2.3: Custom CV Generator (Guided)

Build a custom cross-validation generator with gap support.

Exercise
Solution 2.3
def time_series_cv_with_gap(
    n_samples: int,
    n_splits: int = 5,
    gap: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """
    Generate time series CV indices with a gap between train and test.
    """
    test_size = n_samples // (n_splits + 1)

    for i in range(n_splits):
        # Calculate train end index
        train_end = test_size * (i + 1)

        # Calculate test start and end (with gap)
        test_start = train_end + gap
        test_end = test_start + test_size

        # Ensure we don't exceed array bounds
        if test_end > n_samples:
            test_end = n_samples

        # Create index arrays
        train_indices = np.arange(0, train_end)
        test_indices = np.arange(test_start, test_end)

        yield train_indices, test_indices

2.4 Handling Imbalanced Data

Trading labels are often imbalanced (e.g., more up days than down days).

# Check class balance

print("Class Balance Analysis:")
print("="*50)

class_counts = y.value_counts()
class_pcts = y.value_counts(normalize=True) * 100

print(f"\nClass Distribution:")
print(f"  Class 0 (Down): {class_counts[0]} ({class_pcts[0]:.1f}%)")
print(f"  Class 1 (Up):   {class_counts[1]} ({class_pcts[1]:.1f}%)")

imbalance_ratio = class_counts.max() / class_counts.min()
print(f"\nImbalance Ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio > 1.5:
    print("\nWarning: Dataset is imbalanced. Consider:")
    print("  - Class weights")
    print("  - Oversampling minority class")
    print("  - Undersampling majority class")
    print("  - Using appropriate metrics (F1, precision, recall)")
# Techniques for handling imbalanced data

from sklearn.utils.class_weight import compute_class_weight

# Method 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), class_weights))

print("Method 1: Class Weights")
print(f"  Computed weights: {weight_dict}")
print("  Usage: model.fit(X, y, sample_weight=weights)")

# Method 2: Sample weights based on class
sample_weights = np.array([weight_dict[label] for label in y])
print(f"\nMethod 2: Sample Weights")
print(f"  Shape: {sample_weights.shape}")

# Method 3: Using class_weight parameter in sklearn models
print(f"\nMethod 3: Built-in class_weight parameter")
print("  Usage: RandomForestClassifier(class_weight='balanced')")
# Compare balanced vs unbalanced training

from sklearn.metrics import classification_report

# Split data
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Model without class weights
model_unbalanced = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model_unbalanced.fit(X_train, y_train)
y_pred_unbalanced = model_unbalanced.predict(X_test)

# Model with class weights
model_balanced = RandomForestClassifier(
    n_estimators=100, 
    max_depth=5, 
    random_state=42,
    class_weight='balanced'
)
model_balanced.fit(X_train, y_train)
y_pred_balanced = model_balanced.predict(X_test)

print("Comparison: Unbalanced vs Balanced Training")
print("="*60)

print("\nUnbalanced Model:")
print(classification_report(y_test, y_pred_unbalanced, target_names=['Down', 'Up']))

print("\nBalanced Model (class_weight='balanced'):")
print(classification_report(y_test, y_pred_balanced, target_names=['Down', 'Up']))

Open-Ended Exercises

Exercise 2.4: Complete Data Pipeline (Open-ended)

Build a complete data preparation pipeline class.

Exercise
Solution 2.4
class DataPipeline:
    """
    Complete data preparation pipeline for financial ML.
    """

    def __init__(self, symbol: str, period: str = '2y'):
        self.symbol = symbol
        self.period = period
        self.raw_data = None
        self.clean_data = None
        self.train = None
        self.test = None
        self.quality_report = {}

    def fetch(self) -> 'DataPipeline':
        """Download data."""
        self.raw_data = yf.Ticker(self.symbol).history(period=self.period)
        self.quality_report['rows_downloaded'] = len(self.raw_data)
        return self

    def clean(self, fill_method: str = 'ffill') -> 'DataPipeline':
        """Clean missing values."""
        df = self.raw_data.copy()

        # Record missing before
        missing_before = df.isnull().sum().sum()

        # Fill prices
        price_cols = ['Open', 'High', 'Low', 'Close']
        df[price_cols] = df[price_cols].ffill()

        # Fill volume
        df['Volume'] = df['Volume'].fillna(0)

        # Remove remaining NaN rows
        df = df.dropna()

        self.clean_data = df
        self.quality_report['missing_filled'] = missing_before
        self.quality_report['rows_after_cleaning'] = len(df)

        return self

    def handle_outliers(self, column: str = 'returns', n_std: float = 5) -> 'DataPipeline':
        """Detect and optionally cap outliers."""
        df = self.clean_data.copy()

        # Create returns if not exists
        if column not in df.columns:
            df['returns'] = df['Close'].pct_change()

        mean = df[column].mean()
        std = df[column].std()
        lower = mean - n_std * std
        upper = mean + n_std * std

        outliers = (df[column] < lower) | (df[column] > upper)
        self.quality_report['outliers_detected'] = outliers.sum()

        # Cap outliers
        df[column] = df[column].clip(lower, upper)
        self.clean_data = df

        return self

    def split(self, test_size: float = 0.2) -> 'DataPipeline':
        """Time series train/test split."""
        n = len(self.clean_data)
        split_idx = int(n * (1 - test_size))

        self.train = self.clean_data.iloc[:split_idx]
        self.test = self.clean_data.iloc[split_idx:]

        self.quality_report['train_size'] = len(self.train)
        self.quality_report['test_size'] = len(self.test)
        self.quality_report['split_date'] = str(self.clean_data.index[split_idx].date())

        return self

    def get_report(self) -> dict:
        """Return quality report."""
        return self.quality_report

# Test
pipeline = DataPipeline('AAPL', '2y')
pipeline.fetch().clean().handle_outliers().split()

print("Pipeline Report:")
for key, value in pipeline.get_report().items():
    print(f"  {key}: {value}")

Exercise 2.5: Walk-Forward Validator (Open-ended)

Build a walk-forward validation system.

Exercise
Solution 2.5
class WalkForwardValidator:
    """
    Walk-forward validation for time series ML.
    """

    def __init__(self, initial_train_size: int, test_size: int, step_size: int = None):
        """
        Args:
            initial_train_size: Initial training window size
            test_size: Size of each test window
            step_size: How much to move forward (defaults to test_size)
        """
        self.initial_train_size = initial_train_size
        self.test_size = test_size
        self.step_size = step_size or test_size
        self.results = []

    def split(self, X: pd.DataFrame):
        """
        Generate walk-forward splits.

        Yields:
            Tuple of (train_idx, test_idx)
        """
        n = len(X)
        train_end = self.initial_train_size

        while train_end + self.test_size <= n:
            train_idx = np.arange(0, train_end)
            test_idx = np.arange(train_end, train_end + self.test_size)

            yield train_idx, test_idx

            train_end += self.step_size

    def validate(self, X, y, model, metric_func) -> dict:
        """
        Run walk-forward validation.

        Args:
            X: Feature DataFrame
            y: Target Series
            model: sklearn-compatible model
            metric_func: Function(y_true, y_pred) -> float

        Returns:
            Dictionary with validation results
        """
        self.results = []

        for fold, (train_idx, test_idx) in enumerate(self.split(X), 1):
            X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

            # Train and predict
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            # Calculate metric
            score = metric_func(y_test, y_pred)

            self.results.append({
                'fold': fold,
                'train_start': X.index[train_idx[0]],
                'train_end': X.index[train_idx[-1]],
                'test_start': X.index[test_idx[0]],
                'test_end': X.index[test_idx[-1]],
                'train_size': len(train_idx),
                'test_size': len(test_idx),
                'score': score
            })

        return {
            'folds': self.results,
            'mean_score': np.mean([r['score'] for r in self.results]),
            'std_score': np.std([r['score'] for r in self.results])
        }

# Test
wfv = WalkForwardValidator(initial_train_size=200, test_size=50, step_size=50)
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)

results = wfv.validate(X, y, model, accuracy_score)

print("Walk-Forward Validation Results:")
print(f"Mean Score: {results['mean_score']:.3f} (+/- {results['std_score']:.3f})")
print(f"\nFolds: {len(results['folds'])}")
for fold in results['folds'][:3]:
    print(f"  Fold {fold['fold']}: {fold['score']:.3f} ({fold['test_start'].date()} to {fold['test_end'].date()})")

Exercise 2.6: Imbalanced Data Handler (Open-ended)

Create a comprehensive solution for handling imbalanced trading labels.

Exercise
Solution 2.6
class ImbalancedDataHandler:
    """
    Handle imbalanced classification data in trading.
    """

    def __init__(self, y: pd.Series):
        self.y = y
        self.class_counts = y.value_counts()
        self.class_pcts = y.value_counts(normalize=True)

    def analyze(self) -> dict:
        """Analyze class distribution."""
        return {
            'counts': self.class_counts.to_dict(),
            'percentages': (self.class_pcts * 100).round(2).to_dict(),
            'imbalance_ratio': self.class_counts.max() / self.class_counts.min(),
            'majority_class': self.class_counts.idxmax(),
            'minority_class': self.class_counts.idxmin()
        }

    def compute_class_weights(self, strategy: str = 'balanced') -> dict:
        """
        Compute class weights.

        Strategies:
            - 'balanced': sklearn's balanced method
            - 'inverse': simple inverse frequency
            - 'sqrt_inverse': square root of inverse frequency
        """
        classes = np.unique(self.y)

        if strategy == 'balanced':
            weights = compute_class_weight('balanced', classes=classes, y=self.y)
        elif strategy == 'inverse':
            weights = len(self.y) / (len(classes) * np.array([self.class_counts[c] for c in classes]))
        elif strategy == 'sqrt_inverse':
            weights = np.sqrt(len(self.y) / (len(classes) * np.array([self.class_counts[c] for c in classes])))
        else:
            weights = np.ones(len(classes))

        return dict(zip(classes, weights))

    def compute_sample_weights(self, class_weights: dict = None) -> np.ndarray:
        """Compute per-sample weights from class weights."""
        if class_weights is None:
            class_weights = self.compute_class_weights()

        return np.array([class_weights[label] for label in self.y])

    def get_report(self) -> str:
        """Generate analysis report."""
        analysis = self.analyze()
        weights = self.compute_class_weights()

        report = "Imbalanced Data Analysis\n" + "="*40 + "\n"
        report += f"\nClass Distribution:\n"
        for cls, count in analysis['counts'].items():
            pct = analysis['percentages'][cls]
            report += f"  Class {cls}: {count} ({pct}%)\n"

        report += f"\nImbalance Ratio: {analysis['imbalance_ratio']:.2f}:1\n"
        report += f"\nRecommended Class Weights:\n"
        for cls, weight in weights.items():
            report += f"  Class {cls}: {weight:.3f}\n"

        return report

# Test
handler = ImbalancedDataHandler(y)
print(handler.get_report())

# Get sample weights
sample_weights = handler.compute_sample_weights()
print(f"\nSample weights shape: {sample_weights.shape}")
print(f"Sample weights range: {sample_weights.min():.3f} - {sample_weights.max():.3f}")

Module Project: Data Preparation Pipeline

Build a complete, production-ready data preparation pipeline.

# Module Project: Complete Data Preparation Pipeline

import pandas as pd
import numpy as np
from typing import Dict, Tuple, Optional
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils.class_weight import compute_class_weight
import yfinance as yf


class MLDataPipeline:
    """
    Production-ready data preparation pipeline for financial ML.
    
    This pipeline handles the complete data preparation workflow:
    1. Data fetching and validation
    2. Cleaning and outlier handling
    3. Time series splitting with purge/embargo
    4. Class balance handling
    5. Quality reporting
    """
    
    def __init__(self, config: Dict = None):
        """
        Initialize pipeline with configuration.
        
        Args:
            config: Pipeline configuration dictionary
        """
        self.config = config or self._default_config()
        self.raw_data = None
        self.processed_data = None
        self.train_data = None
        self.test_data = None
        self.quality_metrics = {}
        self.processing_log = []
    
    def _default_config(self) -> Dict:
        """Default pipeline configuration."""
        return {
            'data': {
                'source': 'yfinance',
                'period': '2y'
            },
            'cleaning': {
                'fill_method': 'ffill',
                'outlier_std': 5.0
            },
            'splitting': {
                'test_size': 0.2,
                'purge_days': 5,
                'embargo_days': 1
            },
            'balance': {
                'strategy': 'balanced'
            }
        }
    
    def _log(self, message: str):
        """Add message to processing log."""
        self.processing_log.append(message)
    
    def fetch_data(self, symbol: str) -> 'MLDataPipeline':
        """
        Fetch data for a symbol.
        
        Args:
            symbol: Ticker symbol
            
        Returns:
            Self for chaining
        """
        period = self.config['data']['period']
        
        self._log(f"Fetching {symbol} data for {period}...")
        
        ticker = yf.Ticker(symbol)
        self.raw_data = ticker.history(period=period)
        
        self.quality_metrics['symbol'] = symbol
        self.quality_metrics['rows_fetched'] = len(self.raw_data)
        self.quality_metrics['date_range'] = (
            str(self.raw_data.index[0].date()),
            str(self.raw_data.index[-1].date())
        )
        
        self._log(f"Fetched {len(self.raw_data)} rows")
        
        return self
    
    def clean_data(self) -> 'MLDataPipeline':
        """
        Clean the data by handling missing values and outliers.
        
        Returns:
            Self for chaining
        """
        self._log("Cleaning data...")
        
        df = self.raw_data.copy()
        
        # Record initial missing
        missing_before = df.isnull().sum().sum()
        
        # Fill price data
        price_cols = ['Open', 'High', 'Low', 'Close']
        fill_method = self.config['cleaning']['fill_method']
        df[price_cols] = df[price_cols].ffill()
        
        # Fill volume
        df['Volume'] = df['Volume'].fillna(0)
        
        # Remove remaining NaN
        df = df.dropna()
        
        # Create returns
        df['returns'] = df['Close'].pct_change()
        
        # Handle outliers
        outlier_std = self.config['cleaning']['outlier_std']
        mean = df['returns'].mean()
        std = df['returns'].std()
        lower, upper = mean - outlier_std * std, mean + outlier_std * std
        outliers = (df['returns'] < lower) | (df['returns'] > upper)
        df['returns'] = df['returns'].clip(lower, upper)
        
        # Drop first row (NaN from pct_change)
        df = df.dropna()
        
        self.processed_data = df
        
        self.quality_metrics['missing_filled'] = missing_before
        self.quality_metrics['outliers_clipped'] = outliers.sum()
        self.quality_metrics['rows_after_cleaning'] = len(df)
        
        self._log(f"Filled {missing_before} missing values")
        self._log(f"Clipped {outliers.sum()} outliers")
        
        return self
    
    def split_data(self) -> 'MLDataPipeline':
        """
        Split data into train and test sets with purge/embargo.
        
        Returns:
            Self for chaining
        """
        self._log("Splitting data...")
        
        df = self.processed_data
        n = len(df)
        
        test_size = self.config['splitting']['test_size']
        purge_days = self.config['splitting']['purge_days']
        embargo_days = self.config['splitting']['embargo_days']
        
        # Calculate split point
        split_idx = int(n * (1 - test_size))
        
        # Apply purge and embargo to training set
        gap = purge_days + embargo_days
        train_end = split_idx - gap if gap > 0 else split_idx
        
        self.train_data = df.iloc[:train_end].copy()
        self.test_data = df.iloc[split_idx:].copy()
        
        self.quality_metrics['train_size'] = len(self.train_data)
        self.quality_metrics['test_size'] = len(self.test_data)
        self.quality_metrics['gap_size'] = gap
        self.quality_metrics['split_date'] = str(df.index[split_idx].date())
        
        self._log(f"Training set: {len(self.train_data)} rows")
        self._log(f"Test set: {len(self.test_data)} rows")
        self._log(f"Gap (purge + embargo): {gap} rows")
        
        return self
    
    def compute_class_weights(self, target_col: str = 'target') -> Dict:
        """
        Compute class weights for imbalanced data.
        
        Args:
            target_col: Name of target column
            
        Returns:
            Dictionary of class weights
        """
        if target_col not in self.train_data.columns:
            return {}
        
        y = self.train_data[target_col]
        classes = np.unique(y)
        weights = compute_class_weight('balanced', classes=classes, y=y)
        
        return dict(zip(classes, weights))
    
    def get_quality_report(self) -> str:
        """
        Generate a comprehensive quality report.
        
        Returns:
            Formatted quality report string
        """
        report = []
        report.append("="*60)
        report.append("DATA PIPELINE QUALITY REPORT")
        report.append("="*60)
        
        report.append(f"\nSymbol: {self.quality_metrics.get('symbol', 'N/A')}")
        report.append(f"Date Range: {self.quality_metrics.get('date_range', 'N/A')}")
        
        report.append("\nData Volume:")
        report.append(f"  Rows fetched: {self.quality_metrics.get('rows_fetched', 'N/A')}")
        report.append(f"  Rows after cleaning: {self.quality_metrics.get('rows_after_cleaning', 'N/A')}")
        
        report.append("\nData Quality:")
        report.append(f"  Missing values filled: {self.quality_metrics.get('missing_filled', 'N/A')}")
        report.append(f"  Outliers clipped: {self.quality_metrics.get('outliers_clipped', 'N/A')}")
        
        report.append("\nTrain/Test Split:")
        report.append(f"  Training samples: {self.quality_metrics.get('train_size', 'N/A')}")
        report.append(f"  Test samples: {self.quality_metrics.get('test_size', 'N/A')}")
        report.append(f"  Gap (purge + embargo): {self.quality_metrics.get('gap_size', 'N/A')}")
        report.append(f"  Split date: {self.quality_metrics.get('split_date', 'N/A')}")
        
        report.append("\nProcessing Log:")
        for log_entry in self.processing_log:
            report.append(f"  - {log_entry}")
        
        return "\n".join(report)
    
    def run(self, symbol: str) -> 'MLDataPipeline':
        """
        Run the complete pipeline.
        
        Args:
            symbol: Ticker symbol to process
            
        Returns:
            Self with processed data
        """
        return self.fetch_data(symbol).clean_data().split_data()


# Run the complete pipeline
print("Running Complete Data Pipeline...")
print("="*60)

# Initialize with custom config
config = {
    'data': {'source': 'yfinance', 'period': '2y'},
    'cleaning': {'fill_method': 'ffill', 'outlier_std': 5.0},
    'splitting': {'test_size': 0.2, 'purge_days': 5, 'embargo_days': 2},
    'balance': {'strategy': 'balanced'}
}

pipeline = MLDataPipeline(config)
pipeline.run('SPY')

# Print quality report
print(pipeline.get_quality_report())

# Show sample of processed data
print("\n" + "="*60)
print("SAMPLE PROCESSED DATA:")
print("="*60)
print(pipeline.processed_data.tail())

Key Takeaways

  1. Data Quality Matters: Clean missing values appropriately for financial data (forward fill for prices, zero for volume)

  2. Never Random Shuffle: Time series data must maintain temporal order in train/test splits

  3. Purge and Embargo: When labels span multiple days, add gaps between train and test to prevent leakage

  4. Time Series CV: Use TimeSeriesSplit or custom walk-forward validation, never standard k-fold

  5. Handle Imbalance: Use class weights or sample weights to address imbalanced trading labels

  6. Document Everything: Keep a processing log to track all data transformations


Next: Module 3 - Feature Engineering

Learn how to create predictive features from price data, technical indicators, and statistical measures.

Module 3: Feature Engineering

Part 1: ML Fundamentals for Finance

Duration Exercises
~2.5 hours 6

Learning Objectives

By the end of this module, you will be able to:

  • Create price-based features for ML models
  • Convert technical indicators into ML features
  • Build statistical features using rolling windows
  • Apply feature selection techniques

3.1 Price-Based Features

The foundation of financial ML features starts with price data transformations.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# Download sample data
print("Downloading data...")
df = yf.Ticker("SPY").history(period="2y")
print(f"Downloaded {len(df)} rows")
# Price-based features

def create_price_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create price-based features from OHLCV data.
    
    Args:
        df: DataFrame with OHLCV columns
        
    Returns:
        DataFrame with added features
    """
    features = df.copy()
    
    # Returns at multiple horizons
    for period in [1, 5, 10, 20]:
        features[f'return_{period}d'] = features['Close'].pct_change(period)
    
    # Log returns (more stable for ML)
    features['log_return'] = np.log(features['Close'] / features['Close'].shift(1))
    
    # Price ratios
    features['close_to_open'] = features['Close'] / features['Open'] - 1
    features['high_to_low'] = features['High'] / features['Low'] - 1
    features['close_to_high'] = features['Close'] / features['High']
    features['close_to_low'] = features['Close'] / features['Low']
    
    # Gap features
    features['overnight_gap'] = features['Open'] / features['Close'].shift(1) - 1
    
    # Volume features
    features['volume_change'] = features['Volume'].pct_change()
    features['volume_ma_ratio'] = features['Volume'] / features['Volume'].rolling(20).mean()
    
    return features

# Create features
df_features = create_price_features(df)

# Show new features
new_cols = [c for c in df_features.columns if c not in df.columns]
print(f"Created {len(new_cols)} price-based features:")
for col in new_cols:
    print(f"  - {col}")
# Volatility features

def create_volatility_features(df: pd.DataFrame, return_col: str = 'return_1d') -> pd.DataFrame:
    """
    Create volatility-based features.
    
    Args:
        df: DataFrame with returns
        return_col: Name of returns column
        
    Returns:
        DataFrame with volatility features
    """
    features = df.copy()
    
    # Historical volatility at different windows
    for window in [5, 10, 20, 60]:
        features[f'volatility_{window}d'] = features[return_col].rolling(window).std() * np.sqrt(252)
    
    # Volatility ratio (short-term vs long-term)
    features['vol_ratio_5_20'] = features['volatility_5d'] / features['volatility_20d']
    
    # Parkinson volatility (uses high/low)
    features['parkinson_vol'] = np.sqrt(
        (np.log(features['High'] / features['Low']) ** 2).rolling(20).mean() / (4 * np.log(2))
    ) * np.sqrt(252)
    
    # Garman-Klass volatility
    features['gk_vol'] = np.sqrt(
        (
            0.5 * (np.log(features['High'] / features['Low']) ** 2) -
            (2 * np.log(2) - 1) * (np.log(features['Close'] / features['Open']) ** 2)
        ).rolling(20).mean()
    ) * np.sqrt(252)
    
    return features

# Add volatility features
df_features = create_volatility_features(df_features)

vol_cols = [c for c in df_features.columns if 'vol' in c.lower()]
print(f"\nVolatility features: {vol_cols}")

Exercise 3.1: Price Feature Generator (Guided)

Build a comprehensive price feature generator.

Exercise
Solution 3.1
def generate_return_features(df: pd.DataFrame, horizons: list = None) -> pd.DataFrame:
    """
    Generate return features at multiple horizons.
    """
    if horizons is None:
        horizons = [1, 2, 5, 10, 20]

    features = df.copy()

    for h in horizons:
        # Calculate simple returns
        features[f'return_{h}d'] = features['Close'].pct_change(h)

        # Calculate log returns
        features[f'log_return_{h}d'] = np.log(features['Close'] / features['Close'].shift(h))

    return features

3.2 Technical Indicator Features

Converting traditional technical indicators into ML-ready features.

# Technical indicator features

def create_technical_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create technical indicator features.
    
    Args:
        df: DataFrame with OHLCV data
        
    Returns:
        DataFrame with technical features
    """
    features = df.copy()
    
    # Moving Averages
    for period in [5, 10, 20, 50, 200]:
        features[f'sma_{period}'] = features['Close'].rolling(period).mean()
        features[f'ema_{period}'] = features['Close'].ewm(span=period, adjust=False).mean()
        
        # Distance from MA (normalized)
        features[f'dist_sma_{period}'] = (features['Close'] - features[f'sma_{period}']) / features[f'sma_{period}']
    
    # RSI
    delta = features['Close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    features['rsi_14'] = 100 - (100 / (1 + rs))
    
    # RSI normalized to [-1, 1] for better ML performance
    features['rsi_normalized'] = (features['rsi_14'] - 50) / 50
    
    # MACD
    ema12 = features['Close'].ewm(span=12, adjust=False).mean()
    ema26 = features['Close'].ewm(span=26, adjust=False).mean()
    features['macd'] = ema12 - ema26
    features['macd_signal'] = features['macd'].ewm(span=9, adjust=False).mean()
    features['macd_hist'] = features['macd'] - features['macd_signal']
    
    # Normalize MACD by price
    features['macd_normalized'] = features['macd'] / features['Close']
    
    # Bollinger Bands
    sma20 = features['Close'].rolling(20).mean()
    std20 = features['Close'].rolling(20).std()
    features['bb_upper'] = sma20 + 2 * std20
    features['bb_lower'] = sma20 - 2 * std20
    features['bb_position'] = (features['Close'] - features['bb_lower']) / (features['bb_upper'] - features['bb_lower'])
    features['bb_width'] = (features['bb_upper'] - features['bb_lower']) / sma20
    
    # ATR
    high_low = features['High'] - features['Low']
    high_close = abs(features['High'] - features['Close'].shift())
    low_close = abs(features['Low'] - features['Close'].shift())
    true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
    features['atr_14'] = true_range.rolling(14).mean()
    features['atr_normalized'] = features['atr_14'] / features['Close']
    
    return features

# Create technical features
df_features = create_technical_features(df)

# Show technical features
tech_cols = ['rsi_14', 'rsi_normalized', 'macd_normalized', 'bb_position', 'bb_width', 'atr_normalized']
print("Sample technical features:")
print(df_features[tech_cols].tail())
# Crossover and divergence features

def create_crossover_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create features based on indicator crossovers and divergences.
    
    Args:
        df: DataFrame with technical indicators
        
    Returns:
        DataFrame with crossover features
    """
    features = df.copy()
    
    # Create MAs if not present
    if 'sma_20' not in features.columns:
        features['sma_20'] = features['Close'].rolling(20).mean()
    if 'sma_50' not in features.columns:
        features['sma_50'] = features['Close'].rolling(50).mean()
    
    # Price above/below MA (binary)
    features['above_sma_20'] = (features['Close'] > features['sma_20']).astype(int)
    features['above_sma_50'] = (features['Close'] > features['sma_50']).astype(int)
    
    # Days since last crossover
    cross_20 = features['above_sma_20'].diff().abs()
    features['days_since_sma20_cross'] = cross_20.groupby((cross_20 == 1).cumsum()).cumcount()
    
    # MA crossovers
    features['golden_cross'] = (
        (features['sma_20'] > features['sma_50']) & 
        (features['sma_20'].shift(1) <= features['sma_50'].shift(1))
    ).astype(int)
    
    features['death_cross'] = (
        (features['sma_20'] < features['sma_50']) & 
        (features['sma_20'].shift(1) >= features['sma_50'].shift(1))
    ).astype(int)
    
    # RSI oversold/overbought
    if 'rsi_14' in features.columns:
        features['rsi_oversold'] = (features['rsi_14'] < 30).astype(int)
        features['rsi_overbought'] = (features['rsi_14'] > 70).astype(int)
    
    return features

# Add crossover features
df_features = create_crossover_features(df_features)

cross_cols = ['above_sma_20', 'above_sma_50', 'golden_cross', 'death_cross']
print("Crossover features:")
print(df_features[cross_cols].tail(10))

Exercise 3.2: Technical Feature Builder (Guided)

Create an RSI feature with multiple periods and normalizations.

Exercise
Solution 3.2
def build_rsi_features(df: pd.DataFrame, periods: list = None) -> pd.DataFrame:
    """
    Build RSI features at multiple periods.
    """
    if periods is None:
        periods = [7, 14, 21]

    features = df.copy()

    for period in periods:
        # Calculate price changes
        delta = features['Close'].diff()

        # Separate gains and losses
        gain = delta.where(delta > 0, 0).rolling(period).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(period).mean()

        # Calculate RSI
        rs = gain / loss
        features[f'rsi_{period}'] = 100 - (100 / (1 + rs))

        # Normalized version
        features[f'rsi_{period}_norm'] = (features[f'rsi_{period}'] - 50) / 50

    return features

3.3 Statistical Features

Statistical transformations that help ML models understand data distribution.

# Statistical features

def create_statistical_features(df: pd.DataFrame, windows: list = None) -> pd.DataFrame:
    """
    Create statistical features using rolling windows.
    
    Args:
        df: DataFrame with price data
        windows: List of window sizes
        
    Returns:
        DataFrame with statistical features
    """
    if windows is None:
        windows = [5, 10, 20]
    
    features = df.copy()
    
    # Ensure returns exist
    if 'returns' not in features.columns:
        features['returns'] = features['Close'].pct_change()
    
    for window in windows:
        # Rolling statistics
        features[f'rolling_mean_{window}'] = features['returns'].rolling(window).mean()
        features[f'rolling_std_{window}'] = features['returns'].rolling(window).std()
        features[f'rolling_min_{window}'] = features['returns'].rolling(window).min()
        features[f'rolling_max_{window}'] = features['returns'].rolling(window).max()
        
        # Z-score of returns
        mean = features['returns'].rolling(window).mean()
        std = features['returns'].rolling(window).std()
        features[f'zscore_{window}'] = (features['returns'] - mean) / std
        
        # Percentile rank
        features[f'percentile_rank_{window}'] = features['returns'].rolling(window).apply(
            lambda x: (x[-1] > x[:-1]).mean() if len(x) > 1 else 0.5,
            raw=True
        )
        
        # Skewness and kurtosis
        features[f'skew_{window}'] = features['returns'].rolling(window).skew()
        features[f'kurtosis_{window}'] = features['returns'].rolling(window).kurt()
    
    return features

# Create statistical features
df_features = create_statistical_features(df)

stat_cols = ['zscore_20', 'percentile_rank_20', 'skew_20', 'kurtosis_20']
print("Statistical features:")
print(df_features[stat_cols].dropna().tail())
# Autocorrelation features

def create_autocorrelation_features(df: pd.DataFrame, lags: list = None, window: int = 60) -> pd.DataFrame:
    """
    Create autocorrelation features.
    
    These capture momentum/mean-reversion patterns.
    
    Args:
        df: DataFrame with returns
        lags: List of lags to compute
        window: Rolling window size
        
    Returns:
        DataFrame with autocorrelation features
    """
    if lags is None:
        lags = [1, 2, 5, 10]
    
    features = df.copy()
    
    if 'returns' not in features.columns:
        features['returns'] = features['Close'].pct_change()
    
    for lag in lags:
        # Rolling autocorrelation
        features[f'autocorr_lag{lag}'] = features['returns'].rolling(window).apply(
            lambda x: x.autocorr(lag=lag) if len(x) > lag else np.nan,
            raw=False
        )
    
    return features

# Create autocorrelation features
df_autocorr = create_autocorrelation_features(df_features[['Close', 'returns']].copy())

autocorr_cols = [c for c in df_autocorr.columns if 'autocorr' in c]
print("Autocorrelation features:")
print(df_autocorr[autocorr_cols].dropna().tail())

Exercise 3.3: Z-Score Feature Builder (Guided)

Build z-score features for multiple columns.

Exercise
Solution 3.3
def create_zscore_features(df: pd.DataFrame, columns: list, window: int = 20) -> pd.DataFrame:
    """
    Create z-score normalized versions of columns.
    """
    features = df.copy()

    for col in columns:
        if col not in features.columns:
            continue

        # Calculate rolling mean
        roll_mean = features[col].rolling(window).mean()

        # Calculate rolling standard deviation
        roll_std = features[col].rolling(window).std()

        # Calculate z-score
        features[f'{col}_zscore'] = (features[col] - roll_mean) / roll_std

        # Clip extreme values for stability
        features[f'{col}_zscore'] = features[f'{col}_zscore'].clip(-3, 3)

    return features

3.4 Feature Selection

Selecting the most predictive features and removing redundant ones.

# Feature correlation analysis

def analyze_feature_correlations(df: pd.DataFrame, feature_cols: list, threshold: float = 0.8) -> dict:
    """
    Analyze correlations between features.
    
    Args:
        df: DataFrame with features
        feature_cols: List of feature columns
        threshold: Correlation threshold for "high" correlation
        
    Returns:
        Dictionary with correlation analysis
    """
    # Calculate correlation matrix
    corr_matrix = df[feature_cols].corr().abs()
    
    # Find highly correlated pairs
    high_corr_pairs = []
    for i in range(len(feature_cols)):
        for j in range(i + 1, len(feature_cols)):
            if corr_matrix.iloc[i, j] >= threshold:
                high_corr_pairs.append({
                    'feature_1': feature_cols[i],
                    'feature_2': feature_cols[j],
                    'correlation': corr_matrix.iloc[i, j]
                })
    
    return {
        'correlation_matrix': corr_matrix,
        'high_correlation_pairs': high_corr_pairs,
        'n_high_corr': len(high_corr_pairs)
    }

# Create a feature set and analyze
df_full = create_price_features(df)
df_full = create_technical_features(df_full)
df_full = df_full.dropna()

# Select numeric feature columns
feature_cols = ['return_1d', 'return_5d', 'return_20d', 'volatility_5d', 'volatility_20d',
                'rsi_normalized', 'macd_normalized', 'bb_position', 'atr_normalized']
feature_cols = [c for c in feature_cols if c in df_full.columns]

corr_analysis = analyze_feature_correlations(df_full, feature_cols)

print(f"High correlation pairs (>0.8): {corr_analysis['n_high_corr']}")
for pair in corr_analysis['high_correlation_pairs'][:5]:
    print(f"  {pair['feature_1']} <-> {pair['feature_2']}: {pair['correlation']:.3f}")
# Feature importance with Random Forest

from sklearn.ensemble import RandomForestClassifier

def get_feature_importance(df: pd.DataFrame, feature_cols: list, target_col: str) -> pd.DataFrame:
    """
    Calculate feature importance using Random Forest.
    
    Args:
        df: DataFrame with features and target
        feature_cols: List of feature columns
        target_col: Name of target column
        
    Returns:
        DataFrame with feature importances
    """
    # Prepare data
    df_clean = df[feature_cols + [target_col]].dropna()
    X = df_clean[feature_cols]
    y = df_clean[target_col]
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    model.fit(X, y)
    
    # Get importances
    importance_df = pd.DataFrame({
        'feature': feature_cols,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    return importance_df

# Create target
df_full['target'] = (df_full['Close'].pct_change().shift(-1) > 0).astype(int)

# Get feature importances
importance_df = get_feature_importance(df_full, feature_cols, 'target')

print("Feature Importances (Random Forest):")
print(importance_df)
# Recursive feature elimination

from sklearn.feature_selection import RFE

def select_features_rfe(df: pd.DataFrame, feature_cols: list, target_col: str, n_features: int = 5) -> list:
    """
    Select top features using Recursive Feature Elimination.
    
    Args:
        df: DataFrame with features and target
        feature_cols: List of feature columns
        target_col: Name of target column
        n_features: Number of features to select
        
    Returns:
        List of selected feature names
    """
    # Prepare data
    df_clean = df[feature_cols + [target_col]].dropna()
    X = df_clean[feature_cols]
    y = df_clean[target_col]
    
    # RFE with Random Forest
    model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
    rfe = RFE(estimator=model, n_features_to_select=n_features, step=1)
    rfe.fit(X, y)
    
    # Get selected features
    selected = [f for f, s in zip(feature_cols, rfe.support_) if s]
    
    return selected

# Select top 5 features
selected_features = select_features_rfe(df_full, feature_cols, 'target', n_features=5)

print(f"\nTop 5 features (RFE):")
for i, feat in enumerate(selected_features, 1):
    print(f"  {i}. {feat}")

Open-Ended Exercises

Exercise 3.4: Complete Feature Library (Open-ended)

Build a comprehensive feature engineering library.

Exercise
Solution 3.4
class FeatureLibrary:
    """
    Comprehensive feature engineering library for financial ML.
    """

    def __init__(self, df: pd.DataFrame):
        self.original_df = df.copy()
        self.features_df = df.copy()
        self.feature_names = []
        self.feature_groups = {}

    def add_price_features(self, horizons: list = None) -> 'FeatureLibrary':
        """Add price-based features."""
        if horizons is None:
            horizons = [1, 5, 10, 20]

        new_features = []

        for h in horizons:
            col = f'return_{h}d'
            self.features_df[col] = self.features_df['Close'].pct_change(h)
            new_features.append(col)

        # Volatility
        for w in [5, 20]:
            col = f'volatility_{w}d'
            self.features_df[col] = self.features_df['return_1d'].rolling(w).std() * np.sqrt(252)
            new_features.append(col)

        self.feature_names.extend(new_features)
        self.feature_groups['price'] = new_features

        return self

    def add_technical_features(self) -> 'FeatureLibrary':
        """Add technical indicator features."""
        new_features = []

        # RSI
        delta = self.features_df['Close'].diff()
        gain = delta.where(delta > 0, 0).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        self.features_df['rsi_14'] = 100 - (100 / (1 + gain / loss))
        self.features_df['rsi_normalized'] = (self.features_df['rsi_14'] - 50) / 50
        new_features.extend(['rsi_14', 'rsi_normalized'])

        # MACD
        ema12 = self.features_df['Close'].ewm(span=12).mean()
        ema26 = self.features_df['Close'].ewm(span=26).mean()
        self.features_df['macd_normalized'] = (ema12 - ema26) / self.features_df['Close']
        new_features.append('macd_normalized')

        # Bollinger position
        sma20 = self.features_df['Close'].rolling(20).mean()
        std20 = self.features_df['Close'].rolling(20).std()
        self.features_df['bb_position'] = (
            (self.features_df['Close'] - (sma20 - 2*std20)) / (4*std20)
        )
        new_features.append('bb_position')

        self.feature_names.extend(new_features)
        self.feature_groups['technical'] = new_features

        return self

    def add_statistical_features(self, window: int = 20) -> 'FeatureLibrary':
        """Add statistical features."""
        new_features = []

        if 'return_1d' not in self.features_df.columns:
            self.features_df['return_1d'] = self.features_df['Close'].pct_change()

        # Z-score
        mean = self.features_df['return_1d'].rolling(window).mean()
        std = self.features_df['return_1d'].rolling(window).std()
        self.features_df['return_zscore'] = (self.features_df['return_1d'] - mean) / std
        self.features_df['return_zscore'] = self.features_df['return_zscore'].clip(-3, 3)
        new_features.append('return_zscore')

        # Skewness and kurtosis
        self.features_df['skew_20'] = self.features_df['return_1d'].rolling(window).skew()
        self.features_df['kurtosis_20'] = self.features_df['return_1d'].rolling(window).kurt()
        new_features.extend(['skew_20', 'kurtosis_20'])

        self.feature_names.extend(new_features)
        self.feature_groups['statistical'] = new_features

        return self

    def select_features(self, target: pd.Series, n_features: int = 10) -> list:
        """Select top features using importance."""
        df_clean = self.features_df[self.feature_names].dropna()
        target_clean = target.loc[df_clean.index]

        model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
        model.fit(df_clean, target_clean)

        importance = pd.Series(model.feature_importances_, index=self.feature_names)
        return importance.nlargest(n_features).index.tolist()

    def get_feature_matrix(self, feature_list: list = None) -> pd.DataFrame:
        """Return clean feature matrix."""
        if feature_list is None:
            feature_list = self.feature_names
        return self.features_df[feature_list].dropna()

    def get_summary(self) -> str:
        """Return feature library summary."""
        summary = f"Feature Library Summary\n" + "="*40 + "\n"
        summary += f"Total features: {len(self.feature_names)}\n"
        for group, features in self.feature_groups.items():
            summary += f"  {group}: {len(features)} features\n"
        return summary

# Test
library = FeatureLibrary(df)
library.add_price_features().add_technical_features().add_statistical_features()

print(library.get_summary())
print(f"\nFeature matrix shape: {library.get_feature_matrix().shape}")

Exercise 3.5: Multi-Timeframe Features (Open-ended)

Create features that combine information from multiple timeframes.

Exercise
Solution 3.5
def create_multi_timeframe_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create features combining multiple timeframes.

    Args:
        df: DataFrame with daily OHLCV data

    Returns:
        DataFrame with multi-timeframe features
    """
    features = df.copy()

    # Daily features (1-5 days)
    features['momentum_daily'] = features['Close'].pct_change(5)
    features['trend_daily'] = (features['Close'] > features['Close'].rolling(5).mean()).astype(int)

    # Weekly features (5-20 days)
    features['momentum_weekly'] = features['Close'].pct_change(20)
    features['trend_weekly'] = (features['Close'] > features['Close'].rolling(20).mean()).astype(int)

    # Monthly features (20-60 days)
    features['momentum_monthly'] = features['Close'].pct_change(60)
    features['trend_monthly'] = (features['Close'] > features['Close'].rolling(60).mean()).astype(int)

    # Timeframe ratios
    features['momentum_ratio_dw'] = features['momentum_daily'] / features['momentum_weekly'].abs().clip(0.001)
    features['momentum_ratio_wm'] = features['momentum_weekly'] / features['momentum_monthly'].abs().clip(0.001)

    # Trend alignment
    features['trend_alignment'] = (
        features['trend_daily'] + features['trend_weekly'] + features['trend_monthly']
    ) / 3  # 0 = all bearish, 1 = all bullish

    # Divergence signals
    features['daily_weekly_divergence'] = (
        (features['trend_daily'] == 1) & (features['trend_weekly'] == 0)
    ).astype(int)

    # Volatility across timeframes
    features['vol_5d'] = features['Close'].pct_change().rolling(5).std() * np.sqrt(252)
    features['vol_20d'] = features['Close'].pct_change().rolling(20).std() * np.sqrt(252)
    features['vol_ratio'] = features['vol_5d'] / features['vol_20d']

    return features

# Test
df_mtf = create_multi_timeframe_features(df)
mtf_cols = ['momentum_daily', 'momentum_weekly', 'momentum_monthly', 
            'trend_alignment', 'vol_ratio']
print("Multi-timeframe features:")
print(df_mtf[mtf_cols].dropna().tail())

Exercise 3.6: Feature Pipeline Builder (Open-ended)

Create a complete feature engineering pipeline that's ready for production.

Exercise
Solution 3.6
import pickle
from sklearn.preprocessing import StandardScaler

class FeaturePipeline:
    """
    Production-ready feature engineering pipeline.
    """

    def __init__(self, config: dict = None):
        self.config = config or self._default_config()
        self.scaler = StandardScaler()
        self.selected_features = None
        self.feature_stats = {}
        self.is_fitted = False

    def _default_config(self) -> dict:
        return {
            'return_horizons': [1, 5, 10, 20],
            'volatility_windows': [5, 20],
            'rsi_period': 14,
            'macd_params': (12, 26, 9),
            'zscore_window': 20,
            'n_features': 10
        }

    def _create_all_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create all features from raw data."""
        features = df.copy()

        # Returns
        for h in self.config['return_horizons']:
            features[f'return_{h}d'] = features['Close'].pct_change(h)

        # Volatility
        for w in self.config['volatility_windows']:
            features[f'vol_{w}d'] = features['return_1d'].rolling(w).std() * np.sqrt(252)

        # RSI
        period = self.config['rsi_period']
        delta = features['Close'].diff()
        gain = delta.where(delta > 0, 0).rolling(period).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(period).mean()
        features['rsi'] = 100 - (100 / (1 + gain / loss))
        features['rsi_norm'] = (features['rsi'] - 50) / 50

        # MACD
        fast, slow, signal = self.config['macd_params']
        ema_fast = features['Close'].ewm(span=fast).mean()
        ema_slow = features['Close'].ewm(span=slow).mean()
        features['macd_norm'] = (ema_fast - ema_slow) / features['Close']

        # Z-score
        window = self.config['zscore_window']
        mean = features['return_1d'].rolling(window).mean()
        std = features['return_1d'].rolling(window).std()
        features['return_zscore'] = ((features['return_1d'] - mean) / std).clip(-3, 3)

        return features

    def fit(self, df: pd.DataFrame, target: pd.Series) -> 'FeaturePipeline':
        """Fit the pipeline on training data."""
        # Create features
        features_df = self._create_all_features(df)

        # Get feature columns
        feature_cols = [c for c in features_df.columns 
                       if c not in ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits']]

        # Clean data
        clean_idx = features_df[feature_cols].dropna().index
        X_clean = features_df.loc[clean_idx, feature_cols]
        y_clean = target.loc[clean_idx]

        # Feature selection
        model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
        model.fit(X_clean, y_clean)
        importance = pd.Series(model.feature_importances_, index=feature_cols)
        self.selected_features = importance.nlargest(self.config['n_features']).index.tolist()

        # Fit scaler
        self.scaler.fit(X_clean[self.selected_features])

        # Store statistics
        self.feature_stats = {
            'means': X_clean[self.selected_features].mean().to_dict(),
            'stds': X_clean[self.selected_features].std().to_dict()
        }

        self.is_fitted = True
        return self

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Transform new data using fitted pipeline."""
        if not self.is_fitted:
            raise ValueError("Pipeline not fitted. Call fit() first.")

        features_df = self._create_all_features(df)
        X = features_df[self.selected_features]
        X_scaled = pd.DataFrame(
            self.scaler.transform(X),
            index=X.index,
            columns=self.selected_features
        )
        return X_scaled

    def fit_transform(self, df: pd.DataFrame, target: pd.Series) -> pd.DataFrame:
        """Fit and transform in one step."""
        self.fit(df, target)
        return self.transform(df)

    def save(self, filepath: str):
        """Save pipeline to file."""
        with open(filepath, 'wb') as f:
            pickle.dump(self, f)

    @classmethod
    def load(cls, filepath: str) -> 'FeaturePipeline':
        """Load pipeline from file."""
        with open(filepath, 'rb') as f:
            return pickle.load(f)

# Test
target = (df['Close'].pct_change().shift(-1) > 0).astype(int)
pipeline = FeaturePipeline()
X = pipeline.fit_transform(df, target)

print(f"Selected features: {pipeline.selected_features}")
print(f"\nTransformed shape: {X.shape}")
print(X.dropna().tail())

Module Project: Feature Engineering Library

Build a comprehensive, reusable feature engineering library.

# Module Project: Feature Engineering Library

import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler


class FinancialFeatureEngine:
    """
    Comprehensive feature engineering library for financial ML.
    
    This library provides a complete suite of features commonly used
    in quantitative trading and financial machine learning.
    """
    
    def __init__(self):
        self.features_df = None
        self.feature_catalog = {}
        self.scaler = StandardScaler()
    
    def fit(self, df: pd.DataFrame) -> 'FinancialFeatureEngine':
        """
        Initialize the feature engine with data.
        
        Args:
            df: DataFrame with OHLCV data
            
        Returns:
            Self for method chaining
        """
        self.features_df = df.copy()
        return self
    
    def add_returns(self, periods: List[int] = None) -> 'FinancialFeatureEngine':
        """
        Add return features at multiple horizons.
        
        Args:
            periods: List of periods for returns
        """
        if periods is None:
            periods = [1, 2, 5, 10, 20]
        
        features = []
        for p in periods:
            col = f'return_{p}d'
            self.features_df[col] = self.features_df['Close'].pct_change(p)
            features.append(col)
            
            # Log returns
            col_log = f'log_return_{p}d'
            self.features_df[col_log] = np.log(
                self.features_df['Close'] / self.features_df['Close'].shift(p)
            )
            features.append(col_log)
        
        self.feature_catalog['returns'] = features
        return self
    
    def add_volatility(self, windows: List[int] = None) -> 'FinancialFeatureEngine':
        """
        Add volatility features.
        
        Args:
            windows: List of rolling window sizes
        """
        if windows is None:
            windows = [5, 10, 20, 60]
        
        # Ensure daily returns exist
        if 'return_1d' not in self.features_df.columns:
            self.features_df['return_1d'] = self.features_df['Close'].pct_change()
        
        features = []
        for w in windows:
            col = f'volatility_{w}d'
            self.features_df[col] = (
                self.features_df['return_1d'].rolling(w).std() * np.sqrt(252)
            )
            features.append(col)
        
        # Volatility ratios
        if 'volatility_5d' in self.features_df.columns and 'volatility_20d' in self.features_df.columns:
            self.features_df['vol_ratio_5_20'] = (
                self.features_df['volatility_5d'] / self.features_df['volatility_20d']
            )
            features.append('vol_ratio_5_20')
        
        self.feature_catalog['volatility'] = features
        return self
    
    def add_momentum(self) -> 'FinancialFeatureEngine':
        """
        Add momentum indicators.
        """
        features = []
        
        # RSI
        delta = self.features_df['Close'].diff()
        gain = delta.where(delta > 0, 0).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        self.features_df['rsi_14'] = 100 - (100 / (1 + gain / loss))
        self.features_df['rsi_normalized'] = (self.features_df['rsi_14'] - 50) / 50
        features.extend(['rsi_14', 'rsi_normalized'])
        
        # MACD
        ema12 = self.features_df['Close'].ewm(span=12).mean()
        ema26 = self.features_df['Close'].ewm(span=26).mean()
        self.features_df['macd'] = ema12 - ema26
        self.features_df['macd_signal'] = self.features_df['macd'].ewm(span=9).mean()
        self.features_df['macd_normalized'] = self.features_df['macd'] / self.features_df['Close']
        features.extend(['macd_normalized'])
        
        # Stochastic
        low_14 = self.features_df['Low'].rolling(14).min()
        high_14 = self.features_df['High'].rolling(14).max()
        self.features_df['stoch_k'] = (
            (self.features_df['Close'] - low_14) / (high_14 - low_14) * 100
        )
        self.features_df['stoch_k_normalized'] = (self.features_df['stoch_k'] - 50) / 50
        features.extend(['stoch_k_normalized'])
        
        self.feature_catalog['momentum'] = features
        return self
    
    def add_trend(self) -> 'FinancialFeatureEngine':
        """
        Add trend-following features.
        """
        features = []
        
        # Distance from moving averages
        for period in [20, 50, 200]:
            sma = self.features_df['Close'].rolling(period).mean()
            col = f'dist_sma_{period}'
            self.features_df[col] = (self.features_df['Close'] - sma) / sma
            features.append(col)
        
        # Trend direction
        self.features_df['above_sma_20'] = (
            self.features_df['Close'] > self.features_df['Close'].rolling(20).mean()
        ).astype(int)
        self.features_df['above_sma_50'] = (
            self.features_df['Close'] > self.features_df['Close'].rolling(50).mean()
        ).astype(int)
        features.extend(['above_sma_20', 'above_sma_50'])
        
        self.feature_catalog['trend'] = features
        return self
    
    def add_volume(self) -> 'FinancialFeatureEngine':
        """
        Add volume-based features.
        """
        features = []
        
        # Volume change
        self.features_df['volume_change'] = self.features_df['Volume'].pct_change()
        features.append('volume_change')
        
        # Volume relative to average
        self.features_df['volume_ma_ratio'] = (
            self.features_df['Volume'] / self.features_df['Volume'].rolling(20).mean()
        )
        features.append('volume_ma_ratio')
        
        # Volume z-score
        vol_mean = self.features_df['Volume'].rolling(20).mean()
        vol_std = self.features_df['Volume'].rolling(20).std()
        self.features_df['volume_zscore'] = (
            (self.features_df['Volume'] - vol_mean) / vol_std
        ).clip(-3, 3)
        features.append('volume_zscore')
        
        self.feature_catalog['volume'] = features
        return self
    
    def add_statistical(self, window: int = 20) -> 'FinancialFeatureEngine':
        """
        Add statistical features.
        
        Args:
            window: Rolling window size
        """
        features = []
        
        if 'return_1d' not in self.features_df.columns:
            self.features_df['return_1d'] = self.features_df['Close'].pct_change()
        
        # Z-score of returns
        mean = self.features_df['return_1d'].rolling(window).mean()
        std = self.features_df['return_1d'].rolling(window).std()
        self.features_df['return_zscore'] = (
            (self.features_df['return_1d'] - mean) / std
        ).clip(-3, 3)
        features.append('return_zscore')
        
        # Skewness and kurtosis
        self.features_df['skew'] = self.features_df['return_1d'].rolling(window).skew()
        self.features_df['kurtosis'] = self.features_df['return_1d'].rolling(window).kurt()
        features.extend(['skew', 'kurtosis'])
        
        self.feature_catalog['statistical'] = features
        return self
    
    def build_all(self) -> 'FinancialFeatureEngine':
        """
        Build all available features.
        """
        return (
            self.add_returns()
                .add_volatility()
                .add_momentum()
                .add_trend()
                .add_volume()
                .add_statistical()
        )
    
    def get_feature_names(self, groups: List[str] = None) -> List[str]:
        """
        Get list of feature names.
        
        Args:
            groups: Optional list of feature groups to include
            
        Returns:
            List of feature names
        """
        if groups is None:
            groups = list(self.feature_catalog.keys())
        
        features = []
        for group in groups:
            if group in self.feature_catalog:
                features.extend(self.feature_catalog[group])
        
        return features
    
    def get_features(self, groups: List[str] = None, dropna: bool = True) -> pd.DataFrame:
        """
        Get feature DataFrame.
        
        Args:
            groups: Optional list of feature groups to include
            dropna: Whether to drop rows with missing values
            
        Returns:
            DataFrame with selected features
        """
        feature_names = self.get_feature_names(groups)
        df = self.features_df[feature_names]
        
        if dropna:
            df = df.dropna()
        
        return df
    
    def get_summary(self) -> str:
        """
        Get a summary of all features.
        
        Returns:
            Formatted summary string
        """
        summary = ["Financial Feature Engine Summary", "="*50]
        
        total = 0
        for group, features in self.feature_catalog.items():
            summary.append(f"\n{group.upper()} ({len(features)} features):")
            for f in features:
                summary.append(f"  - {f}")
            total += len(features)
        
        summary.append(f"\nTOTAL: {total} features")
        
        return "\n".join(summary)


# Demo the feature engine
print("Building Financial Feature Engine...")
print("="*60)

# Initialize and build features
engine = FinancialFeatureEngine()
engine.fit(df).build_all()

# Print summary
print(engine.get_summary())

# Get feature matrix
features = engine.get_features()
print(f"\nFeature matrix shape: {features.shape}")
print("\nSample features:")
print(features.tail())

Key Takeaways

  1. Normalize Features: Convert raw indicators to normalized forms (z-scores, percentages) for better ML performance

  2. Multiple Horizons: Create features at different time scales (1, 5, 20, 60 days) to capture patterns at various frequencies

  3. Feature Types: Combine price-based, technical, and statistical features for comprehensive coverage

  4. Handle Correlations: Remove highly correlated features to reduce redundancy and overfitting

  5. Feature Selection: Use importance scores or RFE to identify the most predictive features

  6. Clip Outliers: Z-scores and other features should be clipped (e.g., ±3) for stability


Next: Module 4 - Target Engineering

Learn how to define prediction targets using the triple barrier method, meta-labeling, and proper handling of overlapping labels.

Module 4: Target Engineering

Part 1: ML Fundamentals for Finance

Duration Exercises
~2.5 hours 6

Learning Objectives

By the end of this module, you will be able to:

  • Define effective prediction targets for trading
  • Implement the triple barrier method for labeling
  • Apply meta-labeling for signal filtering
  • Avoid lookahead bias in target creation

4.1 Defining Targets

The target (what we predict) is just as important as the features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# Download sample data
print("Downloading data...")
df = yf.Ticker("SPY").history(period="2y")
print(f"Downloaded {len(df)} rows")
# Different types of targets

print("Types of Prediction Targets")
print("="*50)

target_types = {
    'Direction': {
        'description': 'Up or down (binary classification)',
        'example': '1 if tomorrow\'s return > 0, else 0',
        'pros': 'Simple, clear signal',
        'cons': 'Ignores magnitude, many small movements'
    },
    'Return Magnitude': {
        'description': 'Actual return value (regression)',
        'example': 'Tomorrow\'s return = 0.5%',
        'pros': 'More information, enables position sizing',
        'cons': 'Harder to predict, noisy'
    },
    'Multi-class Direction': {
        'description': 'Strong up, weak up, neutral, weak down, strong down',
        'example': '0: < -1%, 1: -1% to 0%, 2: 0% to 1%, 3: > 1%',
        'pros': 'More nuanced than binary',
        'cons': 'Class imbalance, harder to train'
    },
    'Triple Barrier': {
        'description': 'First barrier hit: profit, loss, or time',
        'example': 'Label based on which exit occurs first',
        'pros': 'Most realistic for trading',
        'cons': 'More complex to implement'
    }
}

for target, details in target_types.items():
    print(f"\n{target}:")
    print(f"  Description: {details['description']}")
    print(f"  Example: {details['example']}")
    print(f"  Pros: {details['pros']}")
    print(f"  Cons: {details['cons']}")
# Simple direction target

def create_direction_target(df: pd.DataFrame, horizon: int = 1) -> pd.Series:
    """
    Create a simple binary direction target.
    
    Args:
        df: DataFrame with 'Close' column
        horizon: Number of days to look ahead
        
    Returns:
        Series with binary labels (1 = up, 0 = down)
    """
    future_return = df['Close'].pct_change(horizon).shift(-horizon)
    target = (future_return > 0).astype(int)
    return target

# Create targets at different horizons
df['target_1d'] = create_direction_target(df, horizon=1)
df['target_5d'] = create_direction_target(df, horizon=5)
df['target_20d'] = create_direction_target(df, horizon=20)

# Check class balance
print("Class Balance by Horizon:")
for col in ['target_1d', 'target_5d', 'target_20d']:
    pct_up = df[col].mean() * 100
    print(f"  {col}: {pct_up:.1f}% up, {100-pct_up:.1f}% down")
# Multi-class target based on return magnitude

def create_multiclass_target(df: pd.DataFrame, horizon: int = 1, thresholds: list = None) -> pd.Series:
    """
    Create multi-class target based on return magnitude.
    
    Args:
        df: DataFrame with 'Close' column
        horizon: Number of days to look ahead
        thresholds: Return thresholds for classes
        
    Returns:
        Series with multi-class labels
    """
    if thresholds is None:
        thresholds = [-0.02, -0.005, 0.005, 0.02]  # -2%, -0.5%, 0.5%, 2%
    
    future_return = df['Close'].pct_change(horizon).shift(-horizon)
    
    # Create labels: 0 (strong down), 1 (weak down), 2 (neutral), 3 (weak up), 4 (strong up)
    conditions = [
        future_return <= thresholds[0],
        (future_return > thresholds[0]) & (future_return <= thresholds[1]),
        (future_return > thresholds[1]) & (future_return <= thresholds[2]),
        (future_return > thresholds[2]) & (future_return <= thresholds[3]),
        future_return > thresholds[3]
    ]
    labels = [0, 1, 2, 3, 4]
    
    target = np.select(conditions, labels, default=2)
    return pd.Series(target, index=df.index)

# Create multi-class target
df['target_multiclass'] = create_multiclass_target(df, horizon=5)

print("\nMulti-class Target Distribution:")
label_names = ['Strong Down', 'Weak Down', 'Neutral', 'Weak Up', 'Strong Up']
for label, name in enumerate(label_names):
    count = (df['target_multiclass'] == label).sum()
    pct = count / len(df.dropna()) * 100
    print(f"  {label} ({name}): {count} ({pct:.1f}%)")

Exercise 4.1: Target Creator (Guided)

Build a flexible target creation function.

Exercise
Solution 4.1
def create_target(df: pd.DataFrame, target_type: str = 'direction', 
                  horizon: int = 1, threshold: float = 0.0) -> pd.Series:
    """
    Create different types of prediction targets.
    """
    # Calculate future return
    future_return = df['Close'].pct_change(horizon).shift(-horizon)

    if target_type == 'direction':
        # Binary direction (1 if up, 0 if down)
        target = (future_return > 0).astype(int)

    elif target_type == 'return':
        # Return the actual return value
        target = future_return

    elif target_type == 'threshold':
        # Only label significant moves
        target = pd.Series(0, index=df.index)
        target[future_return > threshold] = 1
        target[future_return < -threshold] = -1

    else:
        raise ValueError(f"Unknown target_type: {target_type}")

    return target

4.2 The Triple Barrier Method

A more realistic labeling approach that mirrors actual trading exits.

# Triple Barrier Method explained

print("Triple Barrier Method")
print("="*50)

print("""
The triple barrier method labels based on which exit occurs first:

         ┌─────────────────────────────────────┐
         │ Take Profit Barrier (upper)         │  → Label = +1
         │ ================================    │
         │                                     │
    Entry│        Price Path                   │
    Point│           /\    /\                  │
         │          /  \  /  \                 │
         │         /    \/    \                │
         │ ================================    │
         │ Stop Loss Barrier (lower)          │  → Label = -1
         └─────────────────────────────────────┘

                                          Time Barrier → Label based on return

Three possible outcomes:
1. Price hits UPPER barrier first → Label = +1 (profitable)
2. Price hits LOWER barrier first → Label = -1 (loss)
3. TIME expires first → Label based on final return (+1, 0, or -1)

Benefits:
- Realistic: Mirrors stop-loss and take-profit orders
- Balanced: Can adjust barriers for class balance
- Actionable: Labels correspond to trading decisions
""")
# Implement triple barrier method

def triple_barrier_labels(
    df: pd.DataFrame,
    take_profit: float = 0.02,
    stop_loss: float = 0.02,
    max_holding: int = 10
) -> pd.DataFrame:
    """
    Apply triple barrier method for labeling.
    
    Args:
        df: DataFrame with 'Close' column
        take_profit: Upper barrier (e.g., 0.02 = 2%)
        stop_loss: Lower barrier (e.g., 0.02 = 2%)
        max_holding: Maximum holding period in days
        
    Returns:
        DataFrame with labels and exit info
    """
    labels = []
    
    for i in range(len(df) - max_holding):
        entry_price = df['Close'].iloc[i]
        entry_date = df.index[i]
        
        # Calculate barriers
        upper_barrier = entry_price * (1 + take_profit)
        lower_barrier = entry_price * (1 - stop_loss)
        
        # Look for barrier touches
        for j in range(1, max_holding + 1):
            if i + j >= len(df):
                break
                
            high = df['High'].iloc[i + j]
            low = df['Low'].iloc[i + j]
            close = df['Close'].iloc[i + j]
            
            # Check upper barrier (take profit)
            if high >= upper_barrier:
                labels.append({
                    'entry_date': entry_date,
                    'exit_date': df.index[i + j],
                    'holding_period': j,
                    'exit_type': 'take_profit',
                    'label': 1
                })
                break
                
            # Check lower barrier (stop loss)
            if low <= lower_barrier:
                labels.append({
                    'entry_date': entry_date,
                    'exit_date': df.index[i + j],
                    'holding_period': j,
                    'exit_type': 'stop_loss',
                    'label': -1
                })
                break
                
            # Check time barrier
            if j == max_holding:
                final_return = (close - entry_price) / entry_price
                label = 1 if final_return > 0 else (-1 if final_return < 0 else 0)
                labels.append({
                    'entry_date': entry_date,
                    'exit_date': df.index[i + j],
                    'holding_period': j,
                    'exit_type': 'time_barrier',
                    'label': label
                })
    
    return pd.DataFrame(labels)

# Apply triple barrier
labels_df = triple_barrier_labels(df, take_profit=0.02, stop_loss=0.02, max_holding=10)

print("Triple Barrier Results:")
print("="*50)
print(f"Total labeled samples: {len(labels_df)}")
print(f"\nExit type distribution:")
print(labels_df['exit_type'].value_counts())
print(f"\nLabel distribution:")
print(labels_df['label'].value_counts())
print(f"\nAverage holding period: {labels_df['holding_period'].mean():.1f} days")
# Visualize triple barrier on a sample

def visualize_triple_barrier(df: pd.DataFrame, start_idx: int, 
                             take_profit: float = 0.02, stop_loss: float = 0.02,
                             max_holding: int = 10):
    """
    Visualize triple barrier for a single trade.
    """
    entry_price = df['Close'].iloc[start_idx]
    upper = entry_price * (1 + take_profit)
    lower = entry_price * (1 - stop_loss)
    
    # Get price path
    end_idx = min(start_idx + max_holding, len(df) - 1)
    prices = df['Close'].iloc[start_idx:end_idx + 1]
    
    fig, ax = plt.subplots(figsize=(12, 5))
    
    # Plot price
    ax.plot(range(len(prices)), prices, 'b-', linewidth=2, label='Price')
    
    # Plot barriers
    ax.axhline(upper, color='green', linestyle='--', label=f'Take Profit ({take_profit:.1%})')
    ax.axhline(lower, color='red', linestyle='--', label=f'Stop Loss ({stop_loss:.1%})')
    ax.axhline(entry_price, color='gray', linestyle=':', alpha=0.5, label='Entry')
    
    # Mark entry
    ax.scatter([0], [entry_price], color='blue', s=100, zorder=5, marker='o')
    
    ax.set_xlabel('Days')
    ax.set_ylabel('Price')
    ax.set_title('Triple Barrier Example')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Visualize an example
visualize_triple_barrier(df, start_idx=100, take_profit=0.02, stop_loss=0.02)

Exercise 4.2: Triple Barrier Labeler (Guided)

Create a simplified triple barrier labeling function.

Exercise
Solution 4.2
def simple_triple_barrier(df: pd.DataFrame, profit_target: float = 0.02, 
                         stop_loss: float = 0.02, max_days: int = 5) -> pd.Series:
    """
    Simplified triple barrier that returns just the labels.
    """
    labels = pd.Series(index=df.index, dtype=float)

    for i in range(len(df) - max_days):
        entry = df['Close'].iloc[i]

        # Calculate barrier levels
        upper = entry * (1 + profit_target)
        lower = entry * (1 - stop_loss)

        label = 0  # Default: time barrier

        for j in range(1, max_days + 1):
            # Check if upper barrier hit
            if df['High'].iloc[i + j] >= upper:
                label = 1
                break

            # Check if lower barrier hit
            if df['Low'].iloc[i + j] <= lower:
                label = -1
                break

        labels.iloc[i] = label

    return labels

4.3 Meta-Labeling

Meta-labeling uses ML to filter another model's signals.

# Meta-labeling concept

print("Meta-Labeling")
print("="*50)

print("""
Meta-labeling is a two-model approach:

┌──────────────────────────────────────────────────────────────┐
│ STEP 1: Primary Model (e.g., trend-following strategy)       │
│         Generates buy/sell signals                           │
│         Example: Buy when price crosses above SMA            │
└───────────────────────────┬──────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ STEP 2: Meta-Model (ML classifier)                          │
│         Filters primary signals: "Should I take this trade?"│
│         Target: Was the primary signal profitable?          │
└───────────────────────────┬──────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ FINAL: Only take trades where both models agree             │
│        Primary says "buy" AND Meta says "likely profitable" │
└──────────────────────────────────────────────────────────────┘

Benefits:
1. Separates signal generation from signal filtering
2. Maintains interpretable primary model
3. Meta-model can consider additional features
4. Often improves precision at cost of recall
""")
# Implement meta-labeling

def create_meta_labels(
    df: pd.DataFrame,
    primary_signal: pd.Series,
    holding_period: int = 5
) -> pd.Series:
    """
    Create meta-labels for primary model signals.
    
    Args:
        df: DataFrame with price data
        primary_signal: Series with primary model signals (1 for long, -1 for short)
        holding_period: Days to hold position
        
    Returns:
        Series with meta-labels (1 if signal was profitable, 0 if not)
    """
    meta_labels = pd.Series(index=df.index, dtype=float)
    
    # Calculate forward returns
    forward_return = df['Close'].pct_change(holding_period).shift(-holding_period)
    
    # Only label when primary signal exists
    signal_idx = primary_signal[primary_signal != 0].index
    
    for idx in signal_idx:
        if idx in forward_return.index and not pd.isna(forward_return.loc[idx]):
            signal = primary_signal.loc[idx]
            ret = forward_return.loc[idx]
            
            # Meta-label: Was the signal profitable?
            # Long signal (1) is profitable if return > 0
            # Short signal (-1) is profitable if return < 0
            profitable = (signal * ret) > 0
            meta_labels.loc[idx] = 1 if profitable else 0
    
    return meta_labels

# Create a simple primary model (MA crossover)
df['sma_20'] = df['Close'].rolling(20).mean()
df['sma_50'] = df['Close'].rolling(50).mean()

# Primary signal: 1 when short MA above long MA
df['primary_signal'] = np.where(df['sma_20'] > df['sma_50'], 1, -1)

# Generate meta-labels
meta_labels = create_meta_labels(df, df['primary_signal'], holding_period=10)

print("Meta-Label Distribution:")
print(meta_labels.dropna().value_counts())
print(f"\nWin rate of primary signals: {meta_labels.dropna().mean():.1%}")
# Complete meta-labeling example

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

def meta_labeling_example(df: pd.DataFrame):
    """
    Complete meta-labeling workflow example.
    """
    # Step 1: Create features
    features = pd.DataFrame(index=df.index)
    features['returns_5d'] = df['Close'].pct_change(5)
    features['volatility'] = df['Close'].pct_change().rolling(20).std()
    features['rsi'] = 50  # Placeholder
    delta = df['Close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    features['rsi'] = 100 - (100 / (1 + gain / loss))
    features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    
    # Step 2: Create primary signal
    sma_20 = df['Close'].rolling(20).mean()
    sma_50 = df['Close'].rolling(50).mean()
    primary_signal = pd.Series(
        np.where(sma_20 > sma_50, 1, -1),
        index=df.index
    )
    
    # Step 3: Create meta-labels
    meta_labels = create_meta_labels(df, primary_signal, holding_period=10)
    
    # Step 4: Prepare data for meta-model
    feature_cols = ['returns_5d', 'volatility', 'rsi', 'volume_ratio']
    df_ml = pd.concat([features[feature_cols], meta_labels.rename('meta_label')], axis=1)
    df_ml = df_ml.dropna()
    
    # Only use rows where we have a signal
    df_ml = df_ml[df_ml['meta_label'].notna()]
    
    X = df_ml[feature_cols]
    y = df_ml['meta_label'].astype(int)
    
    # Step 5: Train meta-model (time series split)
    split_idx = int(len(X) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    meta_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    meta_model.fit(X_train, y_train)
    
    # Step 6: Evaluate
    y_pred = meta_model.predict(X_test)
    
    return {
        'primary_win_rate': y.mean(),
        'meta_accuracy': accuracy_score(y_test, y_pred),
        'test_predictions': pd.Series(y_pred, index=y_test.index),
        'feature_importance': dict(zip(feature_cols, meta_model.feature_importances_))
    }

# Run meta-labeling example
results = meta_labeling_example(df)

print("Meta-Labeling Results:")
print("="*50)
print(f"Primary Model Win Rate: {results['primary_win_rate']:.1%}")
print(f"Meta-Model Accuracy: {results['meta_accuracy']:.1%}")
print(f"\nFeature Importance:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1]):
    print(f"  {feat}: {imp:.3f}")

Exercise 4.3: Meta-Label Generator (Guided)

Create a function that generates meta-labels for any primary signal.

Exercise
Solution 4.3
def generate_meta_labels(df: pd.DataFrame, signal_col: str,
                         profit_threshold: float = 0.01,
                         holding_days: int = 5) -> pd.Series:
    """
    Generate meta-labels for a given signal column.
    """
    # Calculate forward return
    forward_return = df['Close'].pct_change(holding_days).shift(-holding_days)

    # Get signal values
    signal = df[signal_col]

    # Calculate actual profit (signal * return)
    actual_profit = signal * forward_return

    # Create meta-label (1 if profit > threshold, else 0)
    meta_label = (actual_profit > profit_threshold).astype(int)

    # Only keep labels where signal was non-zero
    meta_label = meta_label.where(signal != 0)

    return meta_label

4.4 Avoiding Lookahead Bias

The most common and deadly mistake in financial ML.

# Lookahead bias examples

print("Common Lookahead Bias Mistakes")
print("="*50)

mistakes = [
    {
        'name': 'Using Same-Day Close in Features',
        'wrong': 'feature = (close - sma) / sma  # close includes today',
        'right': 'feature = (close.shift(1) - sma.shift(1)) / sma.shift(1)',
        'explanation': 'Features should only use information available at decision time'
    },
    {
        'name': 'Scaling with Full Dataset Statistics',
        'wrong': 'scaler.fit(X)  # Uses future data statistics',
        'right': 'scaler.fit(X_train)  # Only use training data',
        'explanation': 'Statistics (mean, std) must come only from past data'
    },
    {
        'name': 'Feature Selection Using All Data',
        'wrong': 'Select features using correlation with target on all data',
        'right': 'Select features only on training data',
        'explanation': 'Feature selection is part of model fitting'
    },
    {
        'name': 'Using Adjusted Prices',
        'wrong': 'Split-adjusted prices for historical signals',
        'right': 'Use unadjusted prices, adjust at point in time',
        'explanation': 'Price adjustments are applied retroactively'
    }
]

for i, mistake in enumerate(mistakes, 1):
    print(f"\n{i}. {mistake['name']}")
    print(f"   WRONG: {mistake['wrong']}")
    print(f"   RIGHT: {mistake['right']}")
    print(f"   Why: {mistake['explanation']}")
# Lookahead bias checker

def check_lookahead_bias(df: pd.DataFrame, feature_cols: list, target_col: str) -> dict:
    """
    Check for potential lookahead bias in features.
    
    Args:
        df: DataFrame with features and target
        feature_cols: List of feature columns
        target_col: Name of target column
        
    Returns:
        Dictionary with warnings and analysis
    """
    warnings = []
    analysis = {}
    
    # Check correlation between features and target
    # Extremely high correlation might indicate lookahead
    for col in feature_cols:
        if col in df.columns and target_col in df.columns:
            corr = df[col].corr(df[target_col])
            if abs(corr) > 0.5:
                warnings.append(
                    f"High correlation ({corr:.2f}) between '{col}' and target - "
                    f"possible lookahead bias"
                )
            analysis[col] = {'correlation': corr}
    
    # Check for suspicious column names
    suspicious_keywords = ['future', 'forward', 'next', 'tomorrow']
    for col in feature_cols:
        for keyword in suspicious_keywords:
            if keyword in col.lower():
                warnings.append(
                    f"Column '{col}' contains suspicious keyword '{keyword}'"
                )
    
    return {
        'warnings': warnings,
        'analysis': analysis,
        'has_potential_issues': len(warnings) > 0
    }

# Create some features with potential issues
test_df = df.copy()
test_df['future_return'] = test_df['Close'].pct_change().shift(-1)  # LOOKAHEAD!
test_df['past_return'] = test_df['Close'].pct_change()  # OK
test_df['target'] = (test_df['Close'].pct_change().shift(-1) > 0).astype(int)

# Check for bias
result = check_lookahead_bias(
    test_df.dropna(), 
    ['future_return', 'past_return'], 
    'target'
)

print("Lookahead Bias Check:")
print("="*50)
print(f"Potential issues found: {result['has_potential_issues']}")
if result['warnings']:
    print("\nWarnings:")
    for warning in result['warnings']:
        print(f"  - {warning}")

Open-Ended Exercises

Exercise 4.4: Adaptive Barrier Labels (Open-ended)

Create triple barrier labels with adaptive barriers based on volatility.

Exercise
Solution 4.4
def adaptive_triple_barrier(df: pd.DataFrame, vol_multiplier: float = 2.0,
                           vol_window: int = 20, max_days: int = 10) -> pd.DataFrame:
    """
    Triple barrier with volatility-adaptive barriers.

    Args:
        df: DataFrame with OHLCV data
        vol_multiplier: Multiplier for volatility to set barriers
        vol_window: Window for volatility calculation
        max_days: Maximum holding period

    Returns:
        DataFrame with labels and barrier info
    """
    # Calculate daily volatility
    returns = df['Close'].pct_change()
    volatility = returns.rolling(vol_window).std()

    results = []

    for i in range(vol_window, len(df) - max_days):
        entry_price = df['Close'].iloc[i]
        entry_date = df.index[i]

        # Adaptive barrier width based on current volatility
        current_vol = volatility.iloc[i]
        barrier_width = current_vol * vol_multiplier

        upper = entry_price * (1 + barrier_width)
        lower = entry_price * (1 - barrier_width)

        label = 0
        exit_type = 'time'
        exit_day = max_days

        for j in range(1, max_days + 1):
            if df['High'].iloc[i + j] >= upper:
                label = 1
                exit_type = 'profit'
                exit_day = j
                break
            if df['Low'].iloc[i + j] <= lower:
                label = -1
                exit_type = 'stop'
                exit_day = j
                break

        results.append({
            'date': entry_date,
            'entry_price': entry_price,
            'volatility': current_vol,
            'barrier_width': barrier_width,
            'upper_barrier': upper,
            'lower_barrier': lower,
            'label': label,
            'exit_type': exit_type,
            'exit_day': exit_day
        })

    return pd.DataFrame(results)

# Test
adaptive_labels = adaptive_triple_barrier(df, vol_multiplier=2.0, vol_window=20, max_days=10)

print("Adaptive Triple Barrier Results:")
print(f"Label distribution:\n{adaptive_labels['label'].value_counts()}")
print(f"\nAverage barrier width: {adaptive_labels['barrier_width'].mean():.2%}")
print(f"Barrier width range: {adaptive_labels['barrier_width'].min():.2%} to {adaptive_labels['barrier_width'].max():.2%}")

Exercise 4.5: Label Quality Analyzer (Open-ended)

Build a comprehensive label quality analysis tool.

Exercise
Solution 4.5
class LabelAnalyzer:
    """
    Analyze label quality for financial ML.
    """

    def __init__(self, labels: pd.Series, holding_period: int = 1):
        self.labels = labels.dropna()
        self.holding_period = holding_period

    def class_distribution(self) -> dict:
        """Analyze class distribution."""
        counts = self.labels.value_counts()
        pcts = self.labels.value_counts(normalize=True) * 100

        return {
            'counts': counts.to_dict(),
            'percentages': pcts.round(2).to_dict(),
            'imbalance_ratio': counts.max() / counts.min()
        }

    def label_overlap(self) -> dict:
        """Check for overlapping labels."""
        # Labels overlap if they're within holding_period of each other
        label_dates = self.labels.index

        overlap_count = 0
        for i, date in enumerate(label_dates[:-1]):
            next_date = label_dates[i + 1]
            if (next_date - date).days < self.holding_period:
                overlap_count += 1

        return {
            'total_labels': len(self.labels),
            'overlapping': overlap_count,
            'overlap_percentage': overlap_count / len(self.labels) * 100
        }

    def label_uniqueness(self) -> pd.Series:
        """
        Calculate uniqueness score for each label.

        Labels are less unique if they overlap with many others.
        """
        uniqueness = pd.Series(1.0, index=self.labels.index)

        for i, date in enumerate(self.labels.index):
            # Count overlapping labels
            start = date - pd.Timedelta(days=self.holding_period)
            end = date + pd.Timedelta(days=self.holding_period)

            overlapping = self.labels[(self.labels.index >= start) & 
                                      (self.labels.index <= end) &
                                      (self.labels.index != date)]

            if len(overlapping) > 0:
                uniqueness.loc[date] = 1 / (1 + len(overlapping))

        return uniqueness

    def sample_weights(self) -> pd.Series:
        """Generate sample weights based on uniqueness."""
        uniqueness = self.label_uniqueness()
        # Normalize to sum to number of samples
        weights = uniqueness / uniqueness.sum() * len(uniqueness)
        return weights

    def get_report(self) -> str:
        """Generate full analysis report."""
        dist = self.class_distribution()
        overlap = self.label_overlap()
        uniqueness = self.label_uniqueness()

        report = "Label Quality Report\n" + "="*50 + "\n"

        report += "\nClass Distribution:\n"
        for cls, count in dist['counts'].items():
            pct = dist['percentages'][cls]
            report += f"  Class {cls}: {count} ({pct}%)\n"
        report += f"  Imbalance Ratio: {dist['imbalance_ratio']:.2f}\n"

        report += f"\nLabel Overlap:\n"
        report += f"  Overlapping labels: {overlap['overlapping']} ({overlap['overlap_percentage']:.1f}%)\n"

        report += f"\nLabel Uniqueness:\n"
        report += f"  Mean uniqueness: {uniqueness.mean():.3f}\n"
        report += f"  Min uniqueness: {uniqueness.min():.3f}\n"

        return report

# Test
test_labels = df['target_5d'].dropna()
analyzer = LabelAnalyzer(test_labels, holding_period=5)

print(analyzer.get_report())

Exercise 4.6: Complete Labeling System (Open-ended)

Build a production-ready labeling system.

Exercise
Solution 4.6
class LabelingSystem:
    """
    Production-ready labeling system for financial ML.
    """

    METHODS = ['direction', 'triple_barrier', 'threshold', 'meta']

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.labels = None
        self.sample_weights = None
        self.config = {}

    def create_labels(self, method: str = 'direction', **kwargs) -> 'LabelingSystem':
        """
        Create labels using specified method.

        Args:
            method: Labeling method
            **kwargs: Method-specific parameters
        """
        self.config = {'method': method, **kwargs}

        if method == 'direction':
            horizon = kwargs.get('horizon', 1)
            future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
            self.labels = (future_return > 0).astype(int)

        elif method == 'triple_barrier':
            tp = kwargs.get('take_profit', 0.02)
            sl = kwargs.get('stop_loss', 0.02)
            max_hold = kwargs.get('max_holding', 10)

            self.labels = pd.Series(index=self.df.index, dtype=float)

            for i in range(len(self.df) - max_hold):
                entry = self.df['Close'].iloc[i]
                upper = entry * (1 + tp)
                lower = entry * (1 - sl)
                label = 0

                for j in range(1, max_hold + 1):
                    if self.df['High'].iloc[i + j] >= upper:
                        label = 1
                        break
                    if self.df['Low'].iloc[i + j] <= lower:
                        label = -1
                        break

                self.labels.iloc[i] = label

        elif method == 'threshold':
            horizon = kwargs.get('horizon', 5)
            threshold = kwargs.get('threshold', 0.02)
            future_return = self.df['Close'].pct_change(horizon).shift(-horizon)

            self.labels = pd.Series(0, index=self.df.index)
            self.labels[future_return > threshold] = 1
            self.labels[future_return < -threshold] = -1

        else:
            raise ValueError(f"Unknown method: {method}. Use one of {self.METHODS}")

        return self

    def check_bias(self, feature_df: pd.DataFrame) -> dict:
        """Check for potential lookahead bias."""
        warnings = []

        for col in feature_df.columns:
            # Check suspicious names
            for keyword in ['future', 'forward', 'next']:
                if keyword in col.lower():
                    warnings.append(f"Suspicious column name: {col}")

            # Check high correlation
            if self.labels is not None:
                corr = feature_df[col].corr(self.labels)
                if abs(corr) > 0.5:
                    warnings.append(f"High correlation ({corr:.2f}): {col}")

        return {
            'warnings': warnings,
            'has_issues': len(warnings) > 0
        }

    def compute_sample_weights(self, holding_period: int = None) -> pd.Series:
        """Compute sample weights based on label uniqueness."""
        if self.labels is None:
            raise ValueError("Create labels first")

        if holding_period is None:
            holding_period = self.config.get('horizon', 1)

        weights = pd.Series(1.0, index=self.labels.dropna().index)

        for i, date in enumerate(weights.index):
            start = date - pd.Timedelta(days=holding_period)
            end = date + pd.Timedelta(days=holding_period)

            concurrent = weights[(weights.index >= start) & 
                                (weights.index <= end)]
            weights.loc[date] = 1 / len(concurrent)

        self.sample_weights = weights / weights.sum() * len(weights)
        return self.sample_weights

    def get_labels(self, dropna: bool = True) -> pd.Series:
        """Get the labels."""
        if self.labels is None:
            raise ValueError("Create labels first")
        return self.labels.dropna() if dropna else self.labels

    def get_summary(self) -> str:
        """Get labeling summary."""
        if self.labels is None:
            return "No labels created yet"

        labels = self.labels.dropna()

        summary = "Labeling System Summary\n" + "="*50 + "\n"
        summary += f"Method: {self.config.get('method', 'unknown')}\n"
        summary += f"Config: {self.config}\n"
        summary += f"\nTotal labels: {len(labels)}\n"
        summary += f"Label distribution:\n{labels.value_counts()}\n"

        return summary

# Test
labeler = LabelingSystem(df)
labeler.create_labels('triple_barrier', take_profit=0.02, stop_loss=0.02, max_holding=5)

print(labeler.get_summary())

# Compute weights
weights = labeler.compute_sample_weights(holding_period=5)
print(f"\nSample weights computed: {len(weights)} samples")

Module Project: Target Labeling System

Build a comprehensive target engineering system.

# Module Project: Target Labeling System

import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union


class TargetEngineer:
    """
    Complete target engineering system for financial ML.
    
    Supports multiple labeling methods and provides quality analysis.
    """
    
    def __init__(self, df: pd.DataFrame):
        """
        Initialize with price data.
        
        Args:
            df: DataFrame with OHLCV columns
        """
        self.df = df.copy()
        self.targets = {}
        self.quality_metrics = {}
    
    def create_direction_target(self, name: str, horizon: int = 1) -> pd.Series:
        """
        Create binary direction target.
        
        Args:
            name: Name for this target
            horizon: Days ahead to predict
            
        Returns:
            Series with binary labels
        """
        future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
        target = (future_return > 0).astype(int)
        self.targets[name] = target
        return target
    
    def create_triple_barrier_target(
        self, 
        name: str,
        profit_target: float = 0.02,
        stop_loss: float = 0.02,
        max_holding: int = 10
    ) -> pd.Series:
        """
        Create triple barrier target.
        
        Args:
            name: Name for this target
            profit_target: Take profit level
            stop_loss: Stop loss level
            max_holding: Maximum holding period
            
        Returns:
            Series with labels (-1, 0, 1)
        """
        labels = pd.Series(index=self.df.index, dtype=float)
        
        for i in range(len(self.df) - max_holding):
            entry = self.df['Close'].iloc[i]
            upper = entry * (1 + profit_target)
            lower = entry * (1 - stop_loss)
            
            label = 0
            for j in range(1, max_holding + 1):
                if self.df['High'].iloc[i + j] >= upper:
                    label = 1
                    break
                if self.df['Low'].iloc[i + j] <= lower:
                    label = -1
                    break
            
            labels.iloc[i] = label
        
        self.targets[name] = labels
        return labels
    
    def create_threshold_target(
        self,
        name: str,
        horizon: int = 5,
        threshold: float = 0.02
    ) -> pd.Series:
        """
        Create threshold-based target.
        
        Only labels significant moves.
        
        Args:
            name: Name for this target
            horizon: Days ahead
            threshold: Minimum move to label
            
        Returns:
            Series with labels (-1, 0, 1)
        """
        future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
        
        target = pd.Series(0, index=self.df.index)
        target[future_return > threshold] = 1
        target[future_return < -threshold] = -1
        
        self.targets[name] = target
        return target
    
    def create_meta_target(
        self,
        name: str,
        primary_signal: pd.Series,
        holding_period: int = 5
    ) -> pd.Series:
        """
        Create meta-labels for a primary signal.
        
        Args:
            name: Name for this target
            primary_signal: Primary model signals (1, -1, 0)
            holding_period: Period to evaluate profit
            
        Returns:
            Series with meta-labels (1 = profitable, 0 = not)
        """
        forward_return = self.df['Close'].pct_change(holding_period).shift(-holding_period)
        profit = primary_signal * forward_return
        
        meta_labels = (profit > 0).astype(int)
        meta_labels = meta_labels.where(primary_signal != 0)
        
        self.targets[name] = meta_labels
        return meta_labels
    
    def analyze_target(self, name: str) -> Dict:
        """
        Analyze quality of a target.
        
        Args:
            name: Target name to analyze
            
        Returns:
            Dictionary with quality metrics
        """
        if name not in self.targets:
            raise ValueError(f"Target '{name}' not found")
        
        target = self.targets[name].dropna()
        
        metrics = {
            'total_samples': len(target),
            'class_distribution': target.value_counts().to_dict(),
            'class_percentages': (target.value_counts(normalize=True) * 100).round(2).to_dict(),
            'unique_values': target.nunique(),
            'date_range': (str(target.index[0].date()), str(target.index[-1].date()))
        }
        
        # Calculate imbalance
        counts = target.value_counts()
        metrics['imbalance_ratio'] = counts.max() / counts.min()
        
        self.quality_metrics[name] = metrics
        return metrics
    
    def compute_sample_weights(self, name: str, holding_period: int = 1) -> pd.Series:
        """
        Compute sample weights based on uniqueness.
        
        Args:
            name: Target name
            holding_period: Period for overlap calculation
            
        Returns:
            Series with sample weights
        """
        target = self.targets[name].dropna()
        weights = pd.Series(1.0, index=target.index)
        
        for date in target.index:
            start = date - pd.Timedelta(days=holding_period)
            end = date + pd.Timedelta(days=holding_period)
            concurrent = target[(target.index >= start) & (target.index <= end)]
            weights.loc[date] = 1 / len(concurrent)
        
        # Normalize
        weights = weights / weights.sum() * len(weights)
        return weights
    
    def get_target(self, name: str, dropna: bool = True) -> pd.Series:
        """
        Get a target by name.
        
        Args:
            name: Target name
            dropna: Whether to drop NaN values
            
        Returns:
            Target Series
        """
        if name not in self.targets:
            raise ValueError(f"Target '{name}' not found")
        
        return self.targets[name].dropna() if dropna else self.targets[name]
    
    def list_targets(self) -> List[str]:
        """List all created targets."""
        return list(self.targets.keys())
    
    def get_summary(self) -> str:
        """
        Get summary of all targets.
        
        Returns:
            Formatted summary string
        """
        summary = ["Target Engineering Summary", "="*50]
        
        if not self.targets:
            summary.append("No targets created yet.")
            return "\n".join(summary)
        
        for name in self.targets:
            metrics = self.analyze_target(name)
            summary.append(f"\n{name}:")
            summary.append(f"  Samples: {metrics['total_samples']}")
            summary.append(f"  Classes: {metrics['class_distribution']}")
            summary.append(f"  Imbalance: {metrics['imbalance_ratio']:.2f}")
        
        return "\n".join(summary)


# Demo the target engineering system
print("Target Engineering System Demo")
print("="*60)

# Initialize
engineer = TargetEngineer(df)

# Create different targets
engineer.create_direction_target('direction_1d', horizon=1)
engineer.create_direction_target('direction_5d', horizon=5)
engineer.create_triple_barrier_target('triple_barrier', profit_target=0.02, stop_loss=0.02)
engineer.create_threshold_target('threshold', horizon=5, threshold=0.02)

# Create meta-label
primary_signal = pd.Series(
    np.where(df['Close'] > df['Close'].rolling(20).mean(), 1, -1),
    index=df.index
)
engineer.create_meta_target('meta', primary_signal, holding_period=5)

# Print summary
print(engineer.get_summary())

# Get sample weights
weights = engineer.compute_sample_weights('triple_barrier', holding_period=5)
print(f"\nSample weights computed for triple_barrier")
print(f"  Mean weight: {weights.mean():.3f}")
print(f"  Weight range: {weights.min():.3f} - {weights.max():.3f}")

Key Takeaways

  1. Target Choice Matters: The prediction target determines what the model learns; choose carefully

  2. Triple Barrier Method: More realistic than simple direction labels; mirrors actual trading exits

  3. Meta-Labeling: Separates signal generation from signal filtering; improves precision

  4. Lookahead Bias: The #1 killer of backtests; always verify features don't use future information

  5. Sample Weights: Account for overlapping labels to prevent overweighting similar samples

  6. Class Balance: Trading labels are often imbalanced; monitor and address this


Next: Module 5 - Tree-Based Models

Learn how to build powerful prediction models using decision trees, random forests, and gradient boosting.

Module 5: Tree-Based Models

Part 2: Classification Models

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-4

Learning Objectives

By the end of this module, you will be able to: - Understand decision tree fundamentals and splitting criteria - Build and tune Random Forest classifiers for trading signals - Apply XGBoost and LightGBM for enhanced performance - Handle feature importance and model interpretation - Apply ensemble methods for robust predictions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Gradient boosting libraries
try:
    import xgboost as xgb
    HAS_XGBOOST = True
except ImportError:
    HAS_XGBOOST = False
    print("XGBoost not installed. Install with: pip install xgboost")

try:
    import lightgbm as lgb
    HAS_LIGHTGBM = True
except ImportError:
    HAS_LIGHTGBM = False
    print("LightGBM not installed. Install with: pip install lightgbm")

import yfinance as yf

print("Module 5: Tree-Based Models")
print("=" * 40)

Section 1: Decision Tree Fundamentals

Decision trees are the foundation of many powerful ensemble methods. They make predictions by recursively partitioning the feature space.

# Decision Tree Concepts

tree_concepts = """
DECISION TREE STRUCTURE
=======================

                    [Root Node]
                   RSI > 70?
                  /         \\
                Yes          No
               /               \\
        [Internal]           [Internal]
       Vol > 0.02?          MACD > 0?
       /       \\             /       \\
    [Leaf]   [Leaf]      [Leaf]    [Leaf]
    SELL     HOLD        BUY       HOLD


KEY CONCEPTS:
-------------
1. Splitting Criteria
   - Gini Impurity: How often a randomly chosen element would be incorrectly labeled
   - Entropy: Measure of randomness/disorder in the data
   - Information Gain: Reduction in entropy after a split

2. Tree Parameters
   - max_depth: How deep the tree can grow
   - min_samples_split: Minimum samples to create a split
   - min_samples_leaf: Minimum samples required at leaf node

3. Advantages for Finance
   - Interpretable: Can explain why a prediction was made
   - Non-linear: Captures complex relationships
   - Feature importance: Ranks feature usefulness

4. Disadvantages
   - Prone to overfitting
   - High variance (small data changes → different tree)
   - Greedy algorithm (locally optimal, not globally)
"""
print(tree_concepts)
# Prepare data for tree models

def prepare_trading_data(symbol: str = "SPY", period: str = "2y") -> Tuple[pd.DataFrame, pd.Series]:
    """Prepare data with features and target for classification."""
    
    # Fetch data
    ticker = yf.Ticker(symbol)
    df = ticker.history(period=period)
    
    # Create features
    df['returns'] = df['Close'].pct_change()
    df['volatility'] = df['returns'].rolling(20).std()
    df['momentum_5'] = df['Close'].pct_change(5)
    df['momentum_20'] = df['Close'].pct_change(20)
    
    # Moving averages
    df['sma_5'] = df['Close'].rolling(5).mean()
    df['sma_20'] = df['Close'].rolling(20).mean()
    df['sma_50'] = df['Close'].rolling(50).mean()
    
    # Distance from MAs
    df['dist_sma5'] = (df['Close'] - df['sma_5']) / df['sma_5']
    df['dist_sma20'] = (df['Close'] - df['sma_20']) / df['sma_20']
    df['dist_sma50'] = (df['Close'] - df['sma_50']) / df['sma_50']
    
    # RSI
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # Volume features
    df['volume_ma'] = df['Volume'].rolling(20).mean()
    df['volume_ratio'] = df['Volume'] / df['volume_ma']
    
    # Target: next day direction
    df['target'] = (df['returns'].shift(-1) > 0).astype(int)
    
    # Clean
    df = df.dropna()
    
    features = ['volatility', 'momentum_5', 'momentum_20', 'dist_sma5', 
                'dist_sma20', 'dist_sma50', 'rsi', 'volume_ratio']
    
    X = df[features]
    y = df['target']
    
    return X, y

# Prepare data
X, y = prepare_trading_data()
print(f"Dataset: {len(X)} samples, {len(X.columns)} features")
print(f"Target distribution: {y.value_counts().to_dict()}")
# Build a simple decision tree

# Time series split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Train decision tree with limited depth
dt_model = DecisionTreeClassifier(
    max_depth=3,  # Shallow tree to avoid overfitting
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)

dt_model.fit(X_train, y_train)

# Evaluate
train_acc = dt_model.score(X_train, y_train)
test_acc = dt_model.score(X_test, y_test)

print(f"Decision Tree Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
print(f"  Overfit Gap:    {train_acc - test_acc:.2%}")
# Visualize the decision tree

plt.figure(figsize=(20, 10))
plot_tree(
    dt_model,
    feature_names=X.columns.tolist(),
    class_names=['Down', 'Up'],
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title('Decision Tree for Trading Signal Classification')
plt.tight_layout()
plt.show()
# Feature importance from decision tree

importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Feature Importance')
plt.title('Decision Tree Feature Importance')
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
for _, row in importance_df.iloc[::-1].iterrows():
    print(f"  {row['feature']:15s}: {row['importance']:.4f}")

Section 2: Random Forest

Random Forest combines multiple decision trees using bagging and feature randomization to reduce overfitting and variance.

# Random Forest Concepts

rf_concepts = """
RANDOM FOREST
=============

Key Ideas:
----------
1. Bootstrap Aggregating (Bagging)
   - Train each tree on a random sample of data (with replacement)
   - Reduces variance without increasing bias

2. Feature Randomization
   - Each split considers only a random subset of features
   - Decorrelates trees, improving ensemble performance

3. Aggregation
   - Classification: Majority vote across all trees
   - Regression: Average of all tree predictions

Parameters:
-----------
- n_estimators: Number of trees (more is usually better, diminishing returns)
- max_features: Features to consider at each split ('sqrt', 'log2', or int)
- max_depth: Maximum tree depth (None = fully grown)
- min_samples_split: Minimum samples to make a split
- min_samples_leaf: Minimum samples at leaf nodes

Advantages:
-----------
+ Robust to overfitting (compared to single tree)
+ Handles missing values well
+ Provides feature importance
+ Out-of-bag (OOB) error estimate
+ Parallelizable

Disadvantages:
--------------
- Less interpretable than single tree
- Memory intensive (stores all trees)
- Slower prediction than single tree
"""
print(rf_concepts)
# Build a Random Forest classifier

rf_model = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=5,           # Limit depth to prevent overfitting
    min_samples_split=20,
    min_samples_leaf=10,
    max_features='sqrt',   # sqrt(n_features) at each split
    oob_score=True,        # Calculate out-of-bag score
    random_state=42,
    n_jobs=-1              # Use all cores
)

rf_model.fit(X_train, y_train)

# Evaluate
train_acc = rf_model.score(X_train, y_train)
test_acc = rf_model.score(X_test, y_test)
oob_acc = rf_model.oob_score_

print(f"Random Forest Results:")
print(f"  Train Accuracy:   {train_acc:.2%}")
print(f"  Test Accuracy:    {test_acc:.2%}")
print(f"  OOB Accuracy:     {oob_acc:.2%}")
print(f"  Overfit Gap:      {train_acc - test_acc:.2%}")
# Compare single tree vs Random Forest

print("\nComparison:")
print(f"{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<10}")
print("-" * 54)

dt_train = dt_model.score(X_train, y_train)
dt_test = dt_model.score(X_test, y_test)
print(f"{'Decision Tree':<20} {dt_train:<12.2%} {dt_test:<12.2%} {dt_train-dt_test:<10.2%}")

rf_train = rf_model.score(X_train, y_train)
rf_test = rf_model.score(X_test, y_test)
print(f"{'Random Forest':<20} {rf_train:<12.2%} {rf_test:<12.2%} {rf_train-rf_test:<10.2%}")
# Random Forest feature importance

rf_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(rf_importance['feature'], rf_importance['importance'], color='forestgreen')
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()
# Exercise 5.1: Random Forest Tuner (Guided)

def tune_random_forest(X_train: pd.DataFrame, y_train: pd.Series,
                       n_estimators_list: List[int] = [50, 100, 200],
                       max_depth_list: List[int] = [3, 5, 7]) -> Dict:
    """
    Tune Random Forest hyperparameters using time series cross-validation.
    
    Returns:
        Dictionary with best parameters and scores
    """
    # TODO: Create time series cross-validator with 5 splits
    tscv = ______(n_splits=______)
    
    best_score = -1
    best_params = {}
    results = []
    
    for n_est in n_estimators_list:
        for depth in max_depth_list:
            # TODO: Create Random Forest with current parameters
            model = ______(
                n_estimators=______,
                max_depth=______,
                min_samples_leaf=10,
                random_state=42,
                n_jobs=-1
            )
            
            # TODO: Get cross-validation scores
            scores = ______(model, X_train, y_train, cv=tscv, scoring='accuracy')
            mean_score = scores.______()
            
            results.append({
                'n_estimators': n_est,
                'max_depth': depth,
                'mean_cv_score': mean_score,
                'std_cv_score': scores.std()
            })
            
            if mean_score > best_score:
                best_score = mean_score
                best_params = {'n_estimators': n_est, 'max_depth': depth}
    
    return {
        'best_params': best_params,
        'best_score': best_score,
        'all_results': pd.DataFrame(results)
    }

# Test the function
# tuning_results = tune_random_forest(X_train, y_train)
Solution 5.1
def tune_random_forest(X_train: pd.DataFrame, y_train: pd.Series,
                       n_estimators_list: List[int] = [50, 100, 200],
                       max_depth_list: List[int] = [3, 5, 7]) -> Dict:
    """
    Tune Random Forest hyperparameters using time series cross-validation.
    """
    tscv = TimeSeriesSplit(n_splits=5)

    best_score = -1
    best_params = {}
    results = []

    for n_est in n_estimators_list:
        for depth in max_depth_list:
            model = RandomForestClassifier(
                n_estimators=n_est,
                max_depth=depth,
                min_samples_leaf=10,
                random_state=42,
                n_jobs=-1
            )

            scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy')
            mean_score = scores.mean()

            results.append({
                'n_estimators': n_est,
                'max_depth': depth,
                'mean_cv_score': mean_score,
                'std_cv_score': scores.std()
            })

            if mean_score > best_score:
                best_score = mean_score
                best_params = {'n_estimators': n_est, 'max_depth': depth}

    return {
        'best_params': best_params,
        'best_score': best_score,
        'all_results': pd.DataFrame(results)
    }

Section 3: Gradient Boosting (XGBoost)

Gradient Boosting builds trees sequentially, with each tree correcting the errors of the previous ones.

# Gradient Boosting Concepts

boosting_concepts = """
GRADIENT BOOSTING
=================

Key Idea:
---------
Build trees sequentially, where each tree learns from the errors of all previous trees.

Process:
--------
1. Start with initial prediction (e.g., mean)
2. Calculate residuals (errors)
3. Fit a tree to predict the residuals
4. Update predictions by adding tree * learning_rate
5. Repeat steps 2-4 for n_estimators iterations

XGBoost Advantages:
-------------------
- Regularization: L1 (lasso) and L2 (ridge) to prevent overfitting
- Parallel processing: Faster training
- Built-in cross-validation
- Handles missing values
- Tree pruning: Removes non-essential branches

Key Parameters:
---------------
- n_estimators: Number of boosting rounds
- learning_rate (eta): Step size shrinkage (0.01-0.3 typical)
- max_depth: Maximum tree depth (3-10 typical)
- subsample: Fraction of samples for each tree
- colsample_bytree: Fraction of features for each tree
- reg_alpha: L1 regularization
- reg_lambda: L2 regularization

Random Forest vs XGBoost:
-------------------------
| Aspect        | Random Forest      | XGBoost           |
|---------------|--------------------|--------------------|n| Training      | Parallel (bagging) | Sequential (boost) |
| Trees         | Deep, independent  | Shallow, dependent |
| Overfitting   | Less prone         | More prone         |
| Tuning        | Easier             | More parameters    |
| Performance   | Good baseline      | Often better       |
"""
print(boosting_concepts)
# Scikit-learn's GradientBoostingClassifier

gbc_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_split=20,
    min_samples_leaf=10,
    subsample=0.8,
    random_state=42
)

gbc_model.fit(X_train, y_train)

train_acc = gbc_model.score(X_train, y_train)
test_acc = gbc_model.score(X_test, y_test)

print(f"Gradient Boosting (sklearn) Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
# XGBoost implementation

if HAS_XGBOOST:
    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        min_child_weight=10,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.1,       # L1 regularization
        reg_lambda=1.0,      # L2 regularization
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    
    xgb_model.fit(X_train, y_train)
    
    train_acc = xgb_model.score(X_train, y_train)
    test_acc = xgb_model.score(X_test, y_test)
    
    print(f"XGBoost Results:")
    print(f"  Train Accuracy: {train_acc:.2%}")
    print(f"  Test Accuracy:  {test_acc:.2%}")
else:
    print("XGBoost not available. Install with: pip install xgboost")
# XGBoost with early stopping

if HAS_XGBOOST:
    # Create validation set for early stopping
    val_idx = int(len(X_train) * 0.8)
    X_train_sub, X_val = X_train[:val_idx], X_train[val_idx:]
    y_train_sub, y_val = y_train[:val_idx], y_train[val_idx:]
    
    xgb_early = xgb.XGBClassifier(
        n_estimators=500,  # High number, will stop early
        learning_rate=0.05,
        max_depth=3,
        min_child_weight=10,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss',
        early_stopping_rounds=20
    )
    
    xgb_early.fit(
        X_train_sub, y_train_sub,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    print(f"Early Stopping Results:")
    print(f"  Best iteration: {xgb_early.best_iteration}")
    print(f"  Test Accuracy:  {xgb_early.score(X_test, y_test):.2%}")
# Exercise 5.2: XGBoost Feature Importance Analyzer (Guided)

def analyze_xgb_importance(model, feature_names: List[str], 
                           importance_type: str = 'gain') -> pd.DataFrame:
    """
    Analyze XGBoost feature importance using different metrics.
    
    importance_type: 'weight', 'gain', or 'cover'
      - weight: Number of times feature is used in trees
      - gain: Average improvement in accuracy when feature is used
      - cover: Average number of samples affected by feature splits
    """
    # TODO: Get importance scores from the model's booster
    importance_dict = model.get_booster().______(importance_type=______)
    
    # Create dataframe with feature names and importance
    importance_df = pd.DataFrame([
        {'feature': f, 'importance': importance_dict.get(f, 0)}
        for f in feature_names
    ])
    
    # TODO: Sort by importance descending
    importance_df = importance_df.______(______, ascending=______)
    
    # Normalize to percentages
    total = importance_df['importance'].sum()
    if total > 0:
        importance_df['pct'] = importance_df['importance'] / total * 100
    
    return importance_df

# Test the function
# if HAS_XGBOOST:
#     importance = analyze_xgb_importance(xgb_model, X.columns.tolist())
Solution 5.2
def analyze_xgb_importance(model, feature_names: List[str], 
                           importance_type: str = 'gain') -> pd.DataFrame:
    """
    Analyze XGBoost feature importance using different metrics.
    """
    importance_dict = model.get_booster().get_score(importance_type=importance_type)

    importance_df = pd.DataFrame([
        {'feature': f, 'importance': importance_dict.get(f, 0)}
        for f in feature_names
    ])

    importance_df = importance_df.sort_values('importance', ascending=False)

    total = importance_df['importance'].sum()
    if total > 0:
        importance_df['pct'] = importance_df['importance'] / total * 100

    return importance_df

Section 4: LightGBM

LightGBM is a highly efficient gradient boosting implementation that uses histogram-based learning.

# LightGBM Concepts

lgbm_concepts = """
LIGHTGBM
========

Key Innovations:
----------------
1. Histogram-based Learning
   - Bins continuous features into discrete buckets
   - Much faster than exact split finding
   - Reduces memory usage

2. Leaf-wise Tree Growth
   - Grows tree by splitting leaf with max gain
   - More complex trees, better accuracy
   - Prone to overfitting (use max_depth limit)

3. Gradient-based One-Side Sampling (GOSS)
   - Keeps samples with large gradients
   - Randomly samples from small gradients
   - Faster training with minimal accuracy loss

XGBoost vs LightGBM:
--------------------
| Aspect        | XGBoost          | LightGBM         |
|---------------|------------------|-------------------|
| Tree Growth   | Level-wise       | Leaf-wise        |
| Speed         | Good             | Faster           |
| Memory        | Higher           | Lower            |
| Categoricals  | Needs encoding   | Native support   |
| Overfitting   | Less prone       | More prone       |

Key Parameters:
---------------
- num_leaves: Max leaves per tree (default 31)
- max_depth: Limit tree depth (-1 = unlimited)
- learning_rate: Step size (0.01-0.3)
- feature_fraction: Features per tree (like colsample_bytree)
- bagging_fraction: Samples per tree (like subsample)
"""
print(lgbm_concepts)
# LightGBM implementation

if HAS_LIGHTGBM:
    lgb_model = lgb.LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        num_leaves=31,
        max_depth=5,
        min_child_samples=20,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.1,
        reg_lambda=1.0,
        random_state=42,
        verbose=-1
    )
    
    lgb_model.fit(X_train, y_train)
    
    train_acc = lgb_model.score(X_train, y_train)
    test_acc = lgb_model.score(X_test, y_test)
    
    print(f"LightGBM Results:")
    print(f"  Train Accuracy: {train_acc:.2%}")
    print(f"  Test Accuracy:  {test_acc:.2%}")
else:
    print("LightGBM not available. Install with: pip install lightgbm")
# Compare all tree-based models

print("\n" + "="*60)
print("TREE-BASED MODEL COMPARISON")
print("="*60)

models = {
    'Decision Tree': dt_model,
    'Random Forest': rf_model,
    'GradientBoosting': gbc_model
}

if HAS_XGBOOST:
    models['XGBoost'] = xgb_model
if HAS_LIGHTGBM:
    models['LightGBM'] = lgb_model

print(f"\n{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<10}")
print("-" * 54)

for name, model in models.items():
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"{name:<20} {train_acc:<12.2%} {test_acc:<12.2%} {train_acc-test_acc:<10.2%}")
# Exercise 5.3: LightGBM Trainer with Callbacks (Guided)

def train_lgb_with_callbacks(X_train: pd.DataFrame, y_train: pd.Series,
                              X_val: pd.DataFrame, y_val: pd.Series,
                              params: Dict = None) -> Tuple:
    """
    Train LightGBM with early stopping and logging.
    """
    if not HAS_LIGHTGBM:
        raise ImportError("LightGBM not installed")
    
    default_params = {
        'n_estimators': 500,
        'learning_rate': 0.05,
        'num_leaves': 31,
        'max_depth': 5,
        'random_state': 42,
        'verbose': -1
    }
    
    if params:
        default_params.update(params)
    
    # TODO: Create LightGBM classifier with parameters
    model = lgb.______(**default_params)
    
    # TODO: Fit with evaluation set and early stopping
    model.______(X_train, y_train,
                 eval_set=[(______, ______)],
                 callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)])
    
    # Get best iteration
    best_iter = model.best_iteration_
    
    return model, best_iter

# Test the function
# model, best_iter = train_lgb_with_callbacks(X_train_sub, y_train_sub, X_val, y_val)
Solution 5.3
def train_lgb_with_callbacks(X_train: pd.DataFrame, y_train: pd.Series,
                              X_val: pd.DataFrame, y_val: pd.Series,
                              params: Dict = None) -> Tuple:
    """
    Train LightGBM with early stopping and logging.
    """
    if not HAS_LIGHTGBM:
        raise ImportError("LightGBM not installed")

    default_params = {
        'n_estimators': 500,
        'learning_rate': 0.05,
        'num_leaves': 31,
        'max_depth': 5,
        'random_state': 42,
        'verbose': -1
    }

    if params:
        default_params.update(params)

    model = lgb.LGBMClassifier(**default_params)

    model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)])

    best_iter = model.best_iteration_

    return model, best_iter

Section 5: Ensemble Methods

Combining multiple models can produce more robust predictions than any single model.

# Ensemble Methods Overview

ensemble_concepts = """
ENSEMBLE METHODS
================

1. VOTING ENSEMBLES
   - Hard Voting: Majority vote from all models
   - Soft Voting: Average probabilities, then predict

2. STACKING
   - Train base models on data
   - Use base model predictions as features for meta-model
   - Meta-model learns to combine predictions optimally

3. BLENDING
   - Similar to stacking but uses holdout set
   - Base models trained on training set
   - Meta-model trained on holdout predictions

Why Ensembles Work:
-------------------
- Different models capture different patterns
- Errors from different models tend to cancel out
- More robust to overfitting
- Reduced variance in predictions

Best Practices:
---------------
- Use diverse base models (different algorithms)
- Each model should be better than random
- Models should make different types of errors
- Keep ensemble simple to avoid complexity
"""
print(ensemble_concepts)
# Voting Ensemble

from sklearn.ensemble import VotingClassifier

# Create base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)),
    ('gbc', GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42))
]

# Add XGBoost and LightGBM if available
if HAS_XGBOOST:
    estimators.append(('xgb', xgb.XGBClassifier(
        n_estimators=100, max_depth=3, random_state=42,
        use_label_encoder=False, eval_metric='logloss'
    )))

if HAS_LIGHTGBM:
    estimators.append(('lgb', lgb.LGBMClassifier(
        n_estimators=100, max_depth=5, random_state=42, verbose=-1
    )))

# Create voting classifier
voting_clf = VotingClassifier(
    estimators=estimators,
    voting='soft'  # Use predicted probabilities
)

voting_clf.fit(X_train, y_train)

train_acc = voting_clf.score(X_train, y_train)
test_acc = voting_clf.score(X_test, y_test)

print(f"Voting Ensemble Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
# Custom weighted ensemble

class WeightedEnsemble:
    """Custom weighted ensemble classifier."""
    
    def __init__(self, models: List, weights: List[float] = None):
        self.models = models
        self.weights = weights or [1/len(models)] * len(models)
        
    def fit(self, X: pd.DataFrame, y: pd.Series):
        """Fit all base models."""
        for model in self.models:
            model.fit(X, y)
        return self
    
    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Weighted average of predicted probabilities."""
        probas = np.zeros((len(X), 2))
        
        for model, weight in zip(self.models, self.weights):
            probas += weight * model.predict_proba(X)
            
        return probas / sum(self.weights)
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class labels."""
        probas = self.predict_proba(X)
        return (probas[:, 1] > 0.5).astype(int)
    
    def score(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Calculate accuracy."""
        return accuracy_score(y, self.predict(X))

# Create weighted ensemble
base_models = [
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
]

if HAS_XGBOOST:
    base_models.append(xgb.XGBClassifier(
        n_estimators=100, max_depth=3, random_state=42,
        use_label_encoder=False, eval_metric='logloss'
    ))

# Weight models (higher weight for better performers)
weights = [0.3, 0.3, 0.4] if HAS_XGBOOST else [0.5, 0.5]

weighted_ensemble = WeightedEnsemble(base_models, weights)
weighted_ensemble.fit(X_train, y_train)

print(f"\nWeighted Ensemble Results:")
print(f"  Test Accuracy: {weighted_ensemble.score(X_test, y_test):.2%}")
# Exercise 5.4: Complete Tree-Based Classifier System (Open-ended)
#
# Build a TreeBasedClassifier class that:
# - Supports multiple model types: 'dt', 'rf', 'xgb', 'lgb'
# - Has a tune() method for hyperparameter optimization
# - Has a fit() method that trains the selected model
# - Has a predict() and predict_proba() method
# - Has a get_feature_importance() method returning DataFrame
# - Handles missing XGBoost/LightGBM gracefully
#
# Your implementation:
Solution 5.4
class TreeBasedClassifier:
    """Unified interface for tree-based classifiers."""

    SUPPORTED_MODELS = ['dt', 'rf', 'gbc', 'xgb', 'lgb']

    def __init__(self, model_type: str = 'rf', **kwargs):
        if model_type not in self.SUPPORTED_MODELS:
            raise ValueError(f"Unsupported model: {model_type}")

        self.model_type = model_type
        self.params = kwargs
        self.model = None
        self.feature_names = None

    def _create_model(self, params: Dict):
        """Create model instance based on type."""
        if self.model_type == 'dt':
            return DecisionTreeClassifier(**params)
        elif self.model_type == 'rf':
            return RandomForestClassifier(**params)
        elif self.model_type == 'gbc':
            return GradientBoostingClassifier(**params)
        elif self.model_type == 'xgb':
            if not HAS_XGBOOST:
                raise ImportError("XGBoost not installed")
            params.setdefault('use_label_encoder', False)
            params.setdefault('eval_metric', 'logloss')
            return xgb.XGBClassifier(**params)
        elif self.model_type == 'lgb':
            if not HAS_LIGHTGBM:
                raise ImportError("LightGBM not installed")
            params.setdefault('verbose', -1)
            return lgb.LGBMClassifier(**params)

    def tune(self, X: pd.DataFrame, y: pd.Series, 
             param_grid: Dict = None, cv: int = 5) -> Dict:
        """Tune hyperparameters using cross-validation."""
        from sklearn.model_selection import GridSearchCV

        if param_grid is None:
            param_grid = {
                'max_depth': [3, 5, 7],
                'n_estimators': [50, 100] if self.model_type != 'dt' else [None]
            }

        base_model = self._create_model(self.params)
        tscv = TimeSeriesSplit(n_splits=cv)

        grid_search = GridSearchCV(
            base_model, param_grid, cv=tscv, scoring='accuracy', n_jobs=-1
        )
        grid_search.fit(X, y)

        self.params.update(grid_search.best_params_)
        return grid_search.best_params_

    def fit(self, X: pd.DataFrame, y: pd.Series):
        """Fit the model."""
        self.feature_names = X.columns.tolist()
        self.model = self._create_model(self.params)
        self.model.fit(X, y)
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class labels."""
        return self.model.predict(X)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class probabilities."""
        return self.model.predict_proba(X)

    def get_feature_importance(self) -> pd.DataFrame:
        """Get feature importance as DataFrame."""
        importance = self.model.feature_importances_
        df = pd.DataFrame({
            'feature': self.feature_names,
            'importance': importance
        }).sort_values('importance', ascending=False)
        df['pct'] = df['importance'] / df['importance'].sum() * 100
        return df

    def score(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Calculate accuracy."""
        return accuracy_score(y, self.predict(X))
# Exercise 5.5: Stacking Ensemble Builder (Open-ended)
#
# Build a StackingEnsemble class that:
# - Takes a list of base models and a meta-model
# - Uses cross-validation to generate base model predictions
# - Trains meta-model on stacked predictions
# - Implements fit(), predict(), and predict_proba()
# - Returns individual model contributions
#
# Your implementation:
Solution 5.5
from sklearn.base import clone

class StackingEnsemble:
    """Stacking ensemble with customizable base and meta models."""

    def __init__(self, base_models: List, meta_model, n_folds: int = 5):
        self.base_models = [clone(m) for m in base_models]
        self.meta_model = clone(meta_model)
        self.n_folds = n_folds
        self.fitted_base_models = []

    def fit(self, X: pd.DataFrame, y: pd.Series):
        """Fit stacking ensemble."""
        X_arr = X.values if isinstance(X, pd.DataFrame) else X
        y_arr = y.values if isinstance(y, pd.Series) else y

        n_samples = len(X_arr)
        n_models = len(self.base_models)

        # Out-of-fold predictions for meta features
        meta_features = np.zeros((n_samples, n_models))

        tscv = TimeSeriesSplit(n_splits=self.n_folds)

        for model_idx, model in enumerate(self.base_models):
            for train_idx, val_idx in tscv.split(X_arr):
                cloned = clone(model)
                cloned.fit(X_arr[train_idx], y_arr[train_idx])

                # Store probability for positive class
                meta_features[val_idx, model_idx] = cloned.predict_proba(X_arr[val_idx])[:, 1]

        # Train meta-model on stacked predictions
        # Use only samples that have meta predictions (after first fold)
        mask = meta_features.sum(axis=1) != 0
        self.meta_model.fit(meta_features[mask], y_arr[mask])

        # Refit base models on full data
        self.fitted_base_models = []
        for model in self.base_models:
            fitted = clone(model)
            fitted.fit(X_arr, y_arr)
            self.fitted_base_models.append(fitted)

        return self

    def _get_meta_features(self, X: pd.DataFrame) -> np.ndarray:
        """Generate meta features from base model predictions."""
        X_arr = X.values if isinstance(X, pd.DataFrame) else X
        meta_features = np.column_stack([
            model.predict_proba(X_arr)[:, 1]
            for model in self.fitted_base_models
        ])
        return meta_features

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class labels."""
        meta_features = self._get_meta_features(X)
        return self.meta_model.predict(meta_features)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class probabilities."""
        meta_features = self._get_meta_features(X)
        return self.meta_model.predict_proba(meta_features)

    def get_base_contributions(self, X: pd.DataFrame) -> pd.DataFrame:
        """Get predictions from each base model."""
        return pd.DataFrame(
            self._get_meta_features(X),
            columns=[f'model_{i}' for i in range(len(self.fitted_base_models))]
        )

    def score(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Calculate accuracy."""
        return accuracy_score(y, self.predict(X))
# Exercise 5.6: Model Selection Framework (Open-ended)
#
# Build a TreeModelSelector class that:
# - Automatically trains and compares multiple tree-based models
# - Uses proper time series cross-validation
# - Tracks training time, accuracy, and feature importance
# - Generates a comparison report
# - Recommends the best model based on test performance
# - Provides a plot comparing all models
#
# Your implementation:
Solution 5.6
import time

class TreeModelSelector:
    """Automated tree-based model selection."""

    def __init__(self):
        self.models = {}
        self.results = {}
        self.best_model = None
        self.best_model_name = None

    def _get_default_models(self) -> Dict:
        """Get default set of tree-based models."""
        models = {
            'DecisionTree': DecisionTreeClassifier(max_depth=5, random_state=42),
            'RandomForest': RandomForestClassifier(
                n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
            ),
            'GradientBoosting': GradientBoostingClassifier(
                n_estimators=100, max_depth=3, random_state=42
            )
        }

        if HAS_XGBOOST:
            models['XGBoost'] = xgb.XGBClassifier(
                n_estimators=100, max_depth=3, random_state=42,
                use_label_encoder=False, eval_metric='logloss'
            )

        if HAS_LIGHTGBM:
            models['LightGBM'] = lgb.LGBMClassifier(
                n_estimators=100, max_depth=5, random_state=42, verbose=-1
            )

        return models

    def fit(self, X_train: pd.DataFrame, y_train: pd.Series,
            X_test: pd.DataFrame, y_test: pd.Series,
            custom_models: Dict = None):
        """Train and evaluate all models."""
        self.models = custom_models or self._get_default_models()
        self.feature_names = X_train.columns.tolist()

        for name, model in self.models.items():
            print(f"Training {name}...")

            start_time = time.time()
            model.fit(X_train, y_train)
            train_time = time.time() - start_time

            train_acc = model.score(X_train, y_train)
            test_acc = model.score(X_test, y_test)

            # Get feature importance
            importance = pd.DataFrame({
                'feature': self.feature_names,
                'importance': model.feature_importances_
            }).sort_values('importance', ascending=False)

            self.results[name] = {
                'model': model,
                'train_accuracy': train_acc,
                'test_accuracy': test_acc,
                'overfit_gap': train_acc - test_acc,
                'train_time': train_time,
                'feature_importance': importance
            }

        # Find best model by test accuracy
        self.best_model_name = max(
            self.results.keys(),
            key=lambda x: self.results[x]['test_accuracy']
        )
        self.best_model = self.results[self.best_model_name]['model']

        return self

    def get_comparison_report(self) -> pd.DataFrame:
        """Generate comparison DataFrame."""
        rows = []
        for name, result in self.results.items():
            rows.append({
                'Model': name,
                'Train Acc': f"{result['train_accuracy']:.2%}",
                'Test Acc': f"{result['test_accuracy']:.2%}",
                'Overfit Gap': f"{result['overfit_gap']:.2%}",
                'Train Time (s)': f"{result['train_time']:.2f}"
            })

        df = pd.DataFrame(rows)
        return df.sort_values('Test Acc', ascending=False)

    def plot_comparison(self):
        """Plot model comparison."""
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Accuracy comparison
        models = list(self.results.keys())
        train_accs = [self.results[m]['train_accuracy'] for m in models]
        test_accs = [self.results[m]['test_accuracy'] for m in models]

        x = np.arange(len(models))
        width = 0.35

        axes[0].bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
        axes[0].bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
        axes[0].set_ylabel('Accuracy')
        axes[0].set_xticks(x)
        axes[0].set_xticklabels(models, rotation=45, ha='right')
        axes[0].legend()
        axes[0].set_title('Train vs Test Accuracy')

        # Training time
        times = [self.results[m]['train_time'] for m in models]
        axes[1].bar(models, times, color='green', alpha=0.7)
        axes[1].set_ylabel('Training Time (seconds)')
        axes[1].set_xticklabels(models, rotation=45, ha='right')
        axes[1].set_title('Training Time')

        plt.tight_layout()
        plt.show()

    def recommend(self) -> str:
        """Return recommendation string."""
        result = self.results[self.best_model_name]
        return (
            f"Recommended: {self.best_model_name}\n"
            f"  Test Accuracy: {result['test_accuracy']:.2%}\n"
            f"  Overfit Gap: {result['overfit_gap']:.2%}\n"
            f"  Top Features: {', '.join(result['feature_importance']['feature'].head(3).tolist())}"
        )

Module Project: Complete Tree-Based Trading Signal System

Build a comprehensive system that uses tree-based models for trading signal generation.

class TreeBasedTradingSystem:
    """
    Complete trading signal system using tree-based models.
    
    Features:
    - Multiple model support (RF, XGBoost, LightGBM)
    - Ensemble predictions
    - Feature importance analysis
    - Signal generation with confidence
    """
    
    def __init__(self, model_type: str = 'ensemble'):
        """
        Initialize trading system.
        
        Args:
            model_type: 'rf', 'xgb', 'lgb', or 'ensemble'
        """
        self.model_type = model_type
        self.model = None
        self.feature_names = None
        self.scaler = StandardScaler()
        
    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create trading features from OHLCV data."""
        features = pd.DataFrame(index=df.index)
        
        # Price features
        features['returns'] = df['Close'].pct_change()
        features['volatility'] = features['returns'].rolling(20).std()
        
        # Momentum
        for period in [5, 10, 20]:
            features[f'momentum_{period}'] = df['Close'].pct_change(period)
        
        # Moving average distances
        for period in [5, 20, 50]:
            ma = df['Close'].rolling(period).mean()
            features[f'dist_ma{period}'] = (df['Close'] - ma) / ma
        
        # RSI
        delta = df['Close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        rs = gain / loss
        features['rsi'] = 100 - (100 / (1 + rs))
        
        # Bollinger Band position
        ma20 = df['Close'].rolling(20).mean()
        std20 = df['Close'].rolling(20).std()
        features['bb_position'] = (df['Close'] - ma20) / (2 * std20)
        
        # Volume features
        features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
        
        return features.dropna()
    
    def _create_model(self):
        """Create model based on type."""
        if self.model_type == 'rf':
            return RandomForestClassifier(
                n_estimators=100, max_depth=5, min_samples_leaf=10,
                random_state=42, n_jobs=-1
            )
        elif self.model_type == 'xgb' and HAS_XGBOOST:
            return xgb.XGBClassifier(
                n_estimators=100, max_depth=3, learning_rate=0.1,
                min_child_weight=10, random_state=42,
                use_label_encoder=False, eval_metric='logloss'
            )
        elif self.model_type == 'lgb' and HAS_LIGHTGBM:
            return lgb.LGBMClassifier(
                n_estimators=100, max_depth=5, learning_rate=0.1,
                min_child_samples=20, random_state=42, verbose=-1
            )
        elif self.model_type == 'ensemble':
            estimators = [
                ('rf', RandomForestClassifier(
                    n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
                )),
                ('gbc', GradientBoostingClassifier(
                    n_estimators=100, max_depth=3, random_state=42
                ))
            ]
            if HAS_XGBOOST:
                estimators.append(('xgb', xgb.XGBClassifier(
                    n_estimators=100, max_depth=3, random_state=42,
                    use_label_encoder=False, eval_metric='logloss'
                )))
            return VotingClassifier(estimators=estimators, voting='soft')
        else:
            return RandomForestClassifier(
                n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
            )
    
    def fit(self, df: pd.DataFrame, test_size: float = 0.2):
        """
        Fit the trading system.
        
        Args:
            df: OHLCV DataFrame
            test_size: Fraction for testing
        """
        # Create features
        features = self.create_features(df)
        
        # Align with original data and create target
        aligned_df = df.loc[features.index]
        target = (aligned_df['Close'].pct_change().shift(-1) > 0).astype(int)
        
        # Remove last row (no target)
        features = features[:-1]
        target = target[:-1]
        
        self.feature_names = features.columns.tolist()
        
        # Split
        split_idx = int(len(features) * (1 - test_size))
        X_train = features[:split_idx]
        X_test = features[split_idx:]
        y_train = target[:split_idx]
        y_test = target[split_idx:]
        
        # Scale features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        # Train model
        self.model = self._create_model()
        self.model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_acc = self.model.score(X_train_scaled, y_train)
        test_acc = self.model.score(X_test_scaled, y_test)
        
        print(f"\nTraining Complete ({self.model_type})")
        print(f"  Train Accuracy: {train_acc:.2%}")
        print(f"  Test Accuracy:  {test_acc:.2%}")
        
        return self
    
    def predict_signal(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Generate trading signals with confidence.
        
        Returns:
            DataFrame with signal and confidence
        """
        features = self.create_features(df)
        X_scaled = self.scaler.transform(features)
        
        # Get predictions and probabilities
        predictions = self.model.predict(X_scaled)
        probabilities = self.model.predict_proba(X_scaled)
        
        # Create signals DataFrame
        signals = pd.DataFrame(index=features.index)
        signals['signal'] = predictions
        signals['confidence'] = np.max(probabilities, axis=1)
        signals['signal_name'] = signals['signal'].map({0: 'SELL', 1: 'BUY'})
        
        return signals
    
    def get_feature_importance(self) -> pd.DataFrame:
        """Get feature importance (works for non-ensemble models)."""
        if hasattr(self.model, 'feature_importances_'):
            importance = self.model.feature_importances_
        elif hasattr(self.model, 'estimators_'):
            # For VotingClassifier, average importance from tree-based estimators
            importances = []
            for name, est in self.model.named_estimators_.items():
                if hasattr(est, 'feature_importances_'):
                    importances.append(est.feature_importances_)
            importance = np.mean(importances, axis=0)
        else:
            return pd.DataFrame()
        
        return pd.DataFrame({
            'feature': self.feature_names,
            'importance': importance
        }).sort_values('importance', ascending=False)
    
    def backtest_signals(self, df: pd.DataFrame) -> pd.DataFrame:
        """Simple backtest of trading signals."""
        signals = self.predict_signal(df)
        
        # Align returns
        returns = df['Close'].pct_change().shift(-1)
        aligned_returns = returns.loc[signals.index]
        
        # Strategy returns
        signals['next_return'] = aligned_returns
        signals['strategy_return'] = signals['signal'].shift(1) * signals['next_return']
        
        # Cumulative returns
        signals['cumulative_strategy'] = (1 + signals['strategy_return'].fillna(0)).cumprod()
        signals['cumulative_bh'] = (1 + signals['next_return'].fillna(0)).cumprod()
        
        return signals
# Test the complete system

# Fetch data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")

# Create and train system
trading_system = TreeBasedTradingSystem(model_type='ensemble')
trading_system.fit(data)

# Get feature importance
importance = trading_system.get_feature_importance()
print("\nTop Features:")
print(importance.head(5).to_string(index=False))
# Generate signals and backtest

backtest_results = trading_system.backtest_signals(data)

# Plot cumulative returns
plt.figure(figsize=(12, 6))
plt.plot(backtest_results['cumulative_strategy'].dropna(), label='Strategy', linewidth=2)
plt.plot(backtest_results['cumulative_bh'].dropna(), label='Buy & Hold', linewidth=2, alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.title('Tree-Based Trading Strategy Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate metrics
strategy_return = backtest_results['cumulative_strategy'].iloc[-1] - 1
bh_return = backtest_results['cumulative_bh'].iloc[-1] - 1

print(f"\nPerformance Summary:")
print(f"  Strategy Return: {strategy_return:.2%}")
print(f"  Buy & Hold Return: {bh_return:.2%}")
print(f"  Outperformance: {strategy_return - bh_return:.2%}")

Key Takeaways

  1. Decision Trees are interpretable but prone to overfitting; limit depth and use regularization

  2. Random Forest reduces variance through bagging and feature randomization; a solid baseline for financial ML

  3. XGBoost offers regularization and speed improvements; excellent for structured data

  4. LightGBM is faster with histogram-based learning; watch for overfitting with leaf-wise growth

  5. Ensemble methods (voting, stacking) often outperform single models by combining diverse predictions

  6. Feature importance helps understand what drives predictions and can guide feature engineering

  7. Always use time series cross-validation for financial data to prevent lookahead bias


Next: Module 6 - Other Classification Models (Logistic Regression, SVM, Neural Networks)

Module 6: Other Classification Models

Part 2: Classification Models

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-5

Learning Objectives

By the end of this module, you will be able to: - Apply logistic regression for probabilistic trading signals - Use Support Vector Machines for classification - Build neural network classifiers with sklearn and keras - Understand when to use each model type - Compare model performance on financial data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

import yfinance as yf

print("Module 6: Other Classification Models")
print("=" * 40)
# Prepare data for classification

def prepare_classification_data(symbol: str = "SPY", period: str = "2y") -> Tuple[pd.DataFrame, pd.Series]:
    """Prepare features and target for classification."""
    
    ticker = yf.Ticker(symbol)
    df = ticker.history(period=period)
    
    # Features
    df['returns'] = df['Close'].pct_change()
    df['volatility'] = df['returns'].rolling(20).std()
    df['momentum_5'] = df['Close'].pct_change(5)
    df['momentum_20'] = df['Close'].pct_change(20)
    
    # Moving averages
    for period in [5, 20, 50]:
        ma = df['Close'].rolling(period).mean()
        df[f'dist_ma{period}'] = (df['Close'] - ma) / ma
    
    # RSI
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # Volume
    df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    
    # Target
    df['target'] = (df['returns'].shift(-1) > 0).astype(int)
    
    df = df.dropna()
    
    features = ['volatility', 'momentum_5', 'momentum_20', 'dist_ma5',
                'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
    
    return df[features], df['target']

# Load data
X, y = prepare_classification_data()
print(f"Data shape: {X.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

# Train/test split (time series)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Section 1: Logistic Regression

Logistic regression is a linear model for classification that outputs probabilities. Despite its simplicity, it often works well for financial data.

# Logistic Regression Concepts

logreg_concepts = """
LOGISTIC REGRESSION
===================

How It Works:
-------------
1. Compute linear combination: z = w0 + w1*x1 + w2*x2 + ...
2. Apply sigmoid function: P(y=1) = 1 / (1 + e^(-z))
3. Classify: y = 1 if P(y=1) > threshold else 0

Sigmoid Function:
-----------------
           1.0 ─────────────────────
               │              ╱
           0.5 ├─────────────╳
               │        ╱
           0.0 ─────────────────────
              -5    0    5

Key Features:
-------------
- Outputs probabilities (interpretable)
- Coefficients indicate feature importance and direction
- Can be regularized (L1/L2) to prevent overfitting
- Fast to train and predict

Regularization:
---------------
- L1 (Lasso): penalty='l1', leads to sparse coefficients
- L2 (Ridge): penalty='l2', shrinks coefficients (default)
- C parameter: Inverse of regularization strength (smaller = more regularization)

Advantages for Finance:
-----------------------
+ Probabilistic output (good for position sizing)
+ Interpretable coefficients
+ Fast and simple
+ Regularization handles multicollinearity

Limitations:
------------
- Assumes linear decision boundary
- May underfit complex patterns
- Sensitive to outliers
"""
print(logreg_concepts)
# Basic Logistic Regression

logreg = LogisticRegression(
    penalty='l2',       # L2 regularization
    C=1.0,              # Regularization strength
    solver='lbfgs',     # Optimization algorithm
    max_iter=1000,
    random_state=42
)

logreg.fit(X_train_scaled, y_train)

# Evaluate
train_acc = logreg.score(X_train_scaled, y_train)
test_acc = logreg.score(X_test_scaled, y_test)

print(f"Logistic Regression Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
# Analyze coefficients

coef_df = pd.DataFrame({
    'feature': X.columns,
    'coefficient': logreg.coef_[0],
    'abs_coefficient': np.abs(logreg.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print("\nFeature Coefficients:")
print("(Positive = increases probability of UP, Negative = decreases)")
for _, row in coef_df.iterrows():
    direction = "↑" if row['coefficient'] > 0 else "↓"
    print(f"  {direction} {row['feature']:15s}: {row['coefficient']:+.4f}")

print(f"\nIntercept: {logreg.intercept_[0]:.4f}")
# Visualize coefficients

plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in coef_df['coefficient']]
plt.barh(coef_df['feature'], coef_df['coefficient'], color=colors)
plt.axvline(x=0, color='black', linewidth=0.5)
plt.xlabel('Coefficient Value')
plt.title('Logistic Regression Coefficients')
plt.tight_layout()
plt.show()
# Probability predictions

probabilities = logreg.predict_proba(X_test_scaled)

# Show distribution of probabilities
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(probabilities[:, 1], bins=30, edgecolor='black', alpha=0.7)
plt.axvline(x=0.5, color='red', linestyle='--', label='Decision Boundary')
plt.xlabel('P(UP)')
plt.ylabel('Frequency')
plt.title('Distribution of Predicted Probabilities')
plt.legend()

plt.subplot(1, 2, 2)
# Separate by actual class
prob_up = probabilities[y_test == 1, 1]
prob_down = probabilities[y_test == 0, 1]
plt.hist(prob_up, bins=20, alpha=0.6, label='Actual UP', color='green')
plt.hist(prob_down, bins=20, alpha=0.6, label='Actual DOWN', color='red')
plt.xlabel('P(UP)')
plt.ylabel('Frequency')
plt.title('Probabilities by Actual Class')
plt.legend()

plt.tight_layout()
plt.show()
# Exercise 6.1: Regularization Comparison (Guided)

def compare_regularization(X_train: np.ndarray, y_train: pd.Series,
                           X_test: np.ndarray, y_test: pd.Series,
                           C_values: List[float] = [0.001, 0.01, 0.1, 1.0, 10.0]) -> pd.DataFrame:
    """
    Compare logistic regression with different regularization strengths.
    
    Returns:
        DataFrame with C value, train accuracy, test accuracy, and coefficient stats
    """
    results = []
    
    for C in C_values:
        # TODO: Create logistic regression with given C value
        model = ______(
            penalty='l2',
            C=______,
            max_iter=1000,
            random_state=42
        )
        
        # TODO: Fit the model
        model.______(X_train, y_train)
        
        # TODO: Get train and test accuracy
        train_acc = model.______(X_train, y_train)
        test_acc = model.______(X_test, y_test)
        
        # Coefficient statistics
        coef_sum = np.sum(np.abs(model.coef_))
        
        results.append({
            'C': C,
            'train_accuracy': train_acc,
            'test_accuracy': test_acc,
            'coef_sum': coef_sum
        })
    
    return pd.DataFrame(results)

# Test the function
# reg_results = compare_regularization(X_train_scaled, y_train, X_test_scaled, y_test)
Solution 6.1
def compare_regularization(X_train: np.ndarray, y_train: pd.Series,
                           X_test: np.ndarray, y_test: pd.Series,
                           C_values: List[float] = [0.001, 0.01, 0.1, 1.0, 10.0]) -> pd.DataFrame:
    """
    Compare logistic regression with different regularization strengths.
    """
    results = []

    for C in C_values:
        model = LogisticRegression(
            penalty='l2',
            C=C,
            max_iter=1000,
            random_state=42
        )

        model.fit(X_train, y_train)

        train_acc = model.score(X_train, y_train)
        test_acc = model.score(X_test, y_test)

        coef_sum = np.sum(np.abs(model.coef_))

        results.append({
            'C': C,
            'train_accuracy': train_acc,
            'test_accuracy': test_acc,
            'coef_sum': coef_sum
        })

    return pd.DataFrame(results)

Section 2: Support Vector Machines (SVM)

SVMs find the optimal hyperplane that separates classes with maximum margin.

# SVM Concepts

svm_concepts = """
SUPPORT VECTOR MACHINES
=======================

Key Concepts:
-------------
1. Maximum Margin Classifier
   - Finds hyperplane that maximizes distance to nearest points
   - Support vectors: Points closest to the decision boundary

2. Kernel Trick
   - Maps data to higher dimension for non-linear separation
   - Common kernels: linear, poly, rbf (Gaussian), sigmoid

Visualization (2D):
-------------------
                 ○  ○           ← Class 1
              ○  ○
           ─────────────────    ← Decision boundary
         ●  ●
           ●  ●  ●              ← Class 0

Key Parameters:
---------------
- C: Regularization (higher = less regularization)
- kernel: 'linear', 'rbf', 'poly', 'sigmoid'
- gamma: Kernel coefficient for 'rbf' (higher = more complex)

Advantages:
-----------
+ Effective in high dimensions
+ Works well with clear margins
+ Versatile with different kernels

Disadvantages:
--------------
- Slow for large datasets
- Sensitive to feature scaling
- Probability estimates can be unreliable
- Memory intensive
"""
print(svm_concepts)
# Linear SVM

svm_linear = SVC(
    kernel='linear',
    C=1.0,
    probability=True,  # Enable probability estimates
    random_state=42
)

svm_linear.fit(X_train_scaled, y_train)

train_acc = svm_linear.score(X_train_scaled, y_train)
test_acc = svm_linear.score(X_test_scaled, y_test)

print(f"Linear SVM Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
print(f"  Support Vectors: {len(svm_linear.support_)} / {len(X_train)}")
# RBF (Gaussian) SVM

svm_rbf = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',  # 1 / (n_features * X.var())
    probability=True,
    random_state=42
)

svm_rbf.fit(X_train_scaled, y_train)

train_acc = svm_rbf.score(X_train_scaled, y_train)
test_acc = svm_rbf.score(X_test_scaled, y_test)

print(f"RBF SVM Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
print(f"  Support Vectors: {len(svm_rbf.support_)} / {len(X_train)}")
# Compare different kernels

kernels = ['linear', 'rbf', 'poly', 'sigmoid']
results = []

for kernel in kernels:
    model = SVC(kernel=kernel, C=1.0, probability=True, random_state=42)
    model.fit(X_train_scaled, y_train)
    
    results.append({
        'kernel': kernel,
        'train_acc': model.score(X_train_scaled, y_train),
        'test_acc': model.score(X_test_scaled, y_test),
        'n_sv': len(model.support_)
    })

kernel_df = pd.DataFrame(results)
print("\nKernel Comparison:")
print(kernel_df.to_string(index=False))
# Exercise 6.2: SVM Hyperparameter Tuner (Guided)

def tune_svm(X_train: np.ndarray, y_train: pd.Series,
             C_values: List[float] = [0.1, 1.0, 10.0],
             gamma_values: List[str] = ['scale', 'auto'],
             cv_folds: int = 5) -> Dict:
    """
    Tune SVM hyperparameters using time series cross-validation.
    
    Returns:
        Dictionary with best parameters and all results
    """
    # TODO: Create time series cross-validator
    tscv = ______(n_splits=cv_folds)
    
    best_score = -1
    best_params = {}
    all_results = []
    
    for C in C_values:
        for gamma in gamma_values:
            # TODO: Create SVC with RBF kernel and current parameters
            model = ______(
                kernel='rbf',
                C=______,
                gamma=______,
                random_state=42
            )
            
            # TODO: Get cross-validation scores
            scores = ______(model, X_train, y_train, cv=tscv, scoring='accuracy')
            mean_score = scores.mean()
            
            all_results.append({
                'C': C,
                'gamma': gamma,
                'mean_score': mean_score,
                'std_score': scores.std()
            })
            
            if mean_score > best_score:
                best_score = mean_score
                best_params = {'C': C, 'gamma': gamma}
    
    return {
        'best_params': best_params,
        'best_score': best_score,
        'all_results': pd.DataFrame(all_results)
    }

# Test the function
# svm_results = tune_svm(X_train_scaled, y_train)
Solution 6.2
def tune_svm(X_train: np.ndarray, y_train: pd.Series,
             C_values: List[float] = [0.1, 1.0, 10.0],
             gamma_values: List[str] = ['scale', 'auto'],
             cv_folds: int = 5) -> Dict:
    """
    Tune SVM hyperparameters using time series cross-validation.
    """
    tscv = TimeSeriesSplit(n_splits=cv_folds)

    best_score = -1
    best_params = {}
    all_results = []

    for C in C_values:
        for gamma in gamma_values:
            model = SVC(
                kernel='rbf',
                C=C,
                gamma=gamma,
                random_state=42
            )

            scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy')
            mean_score = scores.mean()

            all_results.append({
                'C': C,
                'gamma': gamma,
                'mean_score': mean_score,
                'std_score': scores.std()
            })

            if mean_score > best_score:
                best_score = mean_score
                best_params = {'C': C, 'gamma': gamma}

    return {
        'best_params': best_params,
        'best_score': best_score,
        'all_results': pd.DataFrame(all_results)
    }

Section 3: Neural Networks (MLP)

Multi-Layer Perceptrons (MLPs) are feedforward neural networks that can learn complex non-linear patterns.

# Neural Network Concepts

nn_concepts = """
NEURAL NETWORKS (MLP)
=====================

Architecture:
-------------
    Input Layer     Hidden Layers      Output Layer
    
       (x1) ──┬──→ (h1) ──┬──→ (h3) ──┬──→ (y)
              │          ╳           │
       (x2) ──┼──→ (h2) ──┼──→ (h4) ──┤
              │          │           │
       (x3) ──┴──→ ...  ─┴──→ ...  ──┘

Key Components:
---------------
1. Neurons: Apply weights, bias, and activation
2. Activation Functions: ReLU, tanh, sigmoid, softmax
3. Backpropagation: Update weights based on error
4. Optimizer: SGD, Adam, etc.

Activation Functions:
---------------------
- ReLU: max(0, x) - most common for hidden layers
- Sigmoid: 1/(1+e^-x) - for binary output
- Softmax: exp(x_i)/sum(exp(x)) - for multi-class
- Tanh: (e^x - e^-x)/(e^x + e^-x)

Key Parameters:
---------------
- hidden_layer_sizes: Tuple, e.g., (100, 50) for 2 layers
- activation: 'relu', 'tanh', 'logistic'
- solver: 'adam', 'sgd', 'lbfgs'
- alpha: L2 regularization
- learning_rate_init: Initial learning rate
- batch_size: Samples per gradient update

Advantages:
-----------
+ Learns complex non-linear patterns
+ Universal approximator
+ Can handle large feature spaces

Disadvantages:
--------------
- Prone to overfitting
- Requires careful tuning
- Black box (less interpretable)
- Needs more data than simpler models
"""
print(nn_concepts)
# Basic MLP Classifier

mlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),  # Two hidden layers
    activation='relu',
    solver='adam',
    alpha=0.001,  # L2 regularization
    batch_size=32,
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    validation_fraction=0.1,
    random_state=42
)

mlp.fit(X_train_scaled, y_train)

train_acc = mlp.score(X_train_scaled, y_train)
test_acc = mlp.score(X_test_scaled, y_test)

print(f"MLP Classifier Results:")
print(f"  Train Accuracy: {train_acc:.2%}")
print(f"  Test Accuracy:  {test_acc:.2%}")
print(f"  Iterations: {mlp.n_iter_}")
print(f"  Final Loss: {mlp.loss_:.4f}")
# Training loss curve

plt.figure(figsize=(10, 5))
plt.plot(mlp.loss_curve_, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('MLP Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Compare different architectures

architectures = [
    (32,),           # Shallow
    (64, 32),        # Two layers
    (128, 64, 32),   # Three layers
    (256, 128, 64),  # Wider
]

results = []
for hidden_sizes in architectures:
    model = MLPClassifier(
        hidden_layer_sizes=hidden_sizes,
        activation='relu',
        solver='adam',
        alpha=0.001,
        max_iter=500,
        early_stopping=True,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    
    results.append({
        'architecture': str(hidden_sizes),
        'train_acc': model.score(X_train_scaled, y_train),
        'test_acc': model.score(X_test_scaled, y_test),
        'iterations': model.n_iter_
    })

arch_df = pd.DataFrame(results)
print("\nArchitecture Comparison:")
print(arch_df.to_string(index=False))
# Exercise 6.3: Neural Network Builder (Guided)

def build_nn_classifier(X_train: np.ndarray, y_train: pd.Series,
                        hidden_sizes: Tuple[int, ...] = (64, 32),
                        dropout_rate: float = 0.2,
                        learning_rate: float = 0.001) -> MLPClassifier:
    """
    Build and train a neural network classifier.
    
    Note: sklearn's MLP doesn't support dropout directly,
    so we use alpha for regularization instead.
    """
    # Approximate dropout effect with alpha
    alpha = dropout_rate * 0.01
    
    # TODO: Create MLP classifier with the given parameters
    model = ______(
        hidden_layer_sizes=______,
        activation='relu',
        solver='adam',
        alpha=______,
        learning_rate_init=______,
        max_iter=500,
        early_stopping=True,
        validation_fraction=0.1,
        random_state=42
    )
    
    # TODO: Fit the model
    model.______(X_train, y_train)
    
    return model

# Test the function
# nn_model = build_nn_classifier(X_train_scaled, y_train)
Solution 6.3
def build_nn_classifier(X_train: np.ndarray, y_train: pd.Series,
                        hidden_sizes: Tuple[int, ...] = (64, 32),
                        dropout_rate: float = 0.2,
                        learning_rate: float = 0.001) -> MLPClassifier:
    """
    Build and train a neural network classifier.
    """
    alpha = dropout_rate * 0.01

    model = MLPClassifier(
        hidden_layer_sizes=hidden_sizes,
        activation='relu',
        solver='adam',
        alpha=alpha,
        learning_rate_init=learning_rate,
        max_iter=500,
        early_stopping=True,
        validation_fraction=0.1,
        random_state=42
    )

    model.fit(X_train, y_train)

    return model

Section 4: Model Comparison

Systematically compare all classification models on trading data.

# Comprehensive model comparison

def get_all_models() -> Dict:
    """Get dictionary of all classification models."""
    return {
        'Logistic (L2)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42),
        'Logistic (L1)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000, random_state=42),
        'SVM (Linear)': SVC(kernel='linear', C=1.0, probability=True, random_state=42),
        'SVM (RBF)': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
        'MLP (Small)': MLPClassifier(hidden_layer_sizes=(32,), max_iter=500, early_stopping=True, random_state=42),
        'MLP (Medium)': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, early_stopping=True, random_state=42),
    }

models = get_all_models()
comparison_results = []

print("Training and evaluating models...\n")

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    
    train_acc = model.score(X_train_scaled, y_train)
    test_acc = model.score(X_test_scaled, y_test)
    
    comparison_results.append({
        'Model': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Overfit Gap': train_acc - test_acc
    })

comparison_df = pd.DataFrame(comparison_results).sort_values('Test Acc', ascending=False)
print(comparison_df.to_string(index=False))
# Visualize comparison

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
x = np.arange(len(comparison_df))
width = 0.35

axes[0].bar(x - width/2, comparison_df['Train Acc'], width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, comparison_df['Test Acc'], width, label='Test', alpha=0.8)
axes[0].set_ylabel('Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Random')
axes[0].set_title('Train vs Test Accuracy')

# Overfitting comparison
colors = ['red' if g > 0.05 else 'green' for g in comparison_df['Overfit Gap']]
axes[1].bar(comparison_df['Model'], comparison_df['Overfit Gap'], color=colors, alpha=0.7)
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].set_ylabel('Overfit Gap (Train - Test)')
axes[1].set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
axes[1].set_title('Overfitting Analysis')

plt.tight_layout()
plt.show()
# Exercise 6.4: Complete Classifier Comparison System (Open-ended)
#
# Build a ClassifierCompare class that:
# - Takes a dictionary of sklearn classifiers
# - Uses time series cross-validation to evaluate each
# - Tracks accuracy, precision, recall, and F1 score
# - Generates a comparison report DataFrame
# - Provides a plot_comparison() method for visualization
# - Recommends the best model with reasoning
#
# Your implementation:
Solution 6.4
from sklearn.metrics import precision_score, recall_score, f1_score

class ClassifierCompare:
    """Compare multiple classifiers systematically."""

    def __init__(self, classifiers: Dict):
        self.classifiers = classifiers
        self.results = {}
        self.fitted_models = {}

    def evaluate(self, X_train: np.ndarray, y_train: pd.Series,
                 X_test: np.ndarray, y_test: pd.Series,
                 cv_folds: int = 5):
        """Evaluate all classifiers."""
        tscv = TimeSeriesSplit(n_splits=cv_folds)

        for name, clf in self.classifiers.items():
            print(f"Evaluating {name}...")

            # Cross-validation
            cv_scores = cross_val_score(clf, X_train, y_train, cv=tscv, scoring='accuracy')

            # Fit on full training set
            clf.fit(X_train, y_train)
            self.fitted_models[name] = clf

            # Predictions
            y_pred = clf.predict(X_test)

            # Metrics
            self.results[name] = {
                'cv_accuracy_mean': cv_scores.mean(),
                'cv_accuracy_std': cv_scores.std(),
                'train_accuracy': clf.score(X_train, y_train),
                'test_accuracy': clf.score(X_test, y_test),
                'precision': precision_score(y_test, y_pred, zero_division=0),
                'recall': recall_score(y_test, y_pred, zero_division=0),
                'f1': f1_score(y_test, y_pred, zero_division=0)
            }

        return self

    def get_report(self) -> pd.DataFrame:
        """Generate comparison report."""
        rows = []
        for name, metrics in self.results.items():
            rows.append({
                'Model': name,
                'CV Acc': f"{metrics['cv_accuracy_mean']:.2%} +/- {metrics['cv_accuracy_std']:.2%}",
                'Train Acc': f"{metrics['train_accuracy']:.2%}",
                'Test Acc': f"{metrics['test_accuracy']:.2%}",
                'Precision': f"{metrics['precision']:.2%}",
                'Recall': f"{metrics['recall']:.2%}",
                'F1': f"{metrics['f1']:.2%}"
            })
        return pd.DataFrame(rows)

    def plot_comparison(self):
        """Visualize comparison."""
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        models = list(self.results.keys())
        test_accs = [self.results[m]['test_accuracy'] for m in models]
        f1_scores = [self.results[m]['f1'] for m in models]

        x = np.arange(len(models))

        axes[0].bar(x, test_accs, color='steelblue', alpha=0.8)
        axes[0].set_ylabel('Test Accuracy')
        axes[0].set_xticks(x)
        axes[0].set_xticklabels(models, rotation=45, ha='right')
        axes[0].axhline(y=0.5, color='red', linestyle='--')
        axes[0].set_title('Test Accuracy Comparison')

        axes[1].bar(x, f1_scores, color='forestgreen', alpha=0.8)
        axes[1].set_ylabel('F1 Score')
        axes[1].set_xticks(x)
        axes[1].set_xticklabels(models, rotation=45, ha='right')
        axes[1].set_title('F1 Score Comparison')

        plt.tight_layout()
        plt.show()

    def recommend(self) -> str:
        """Recommend best model."""
        best_name = max(self.results.keys(),
                       key=lambda x: self.results[x]['test_accuracy'])
        best = self.results[best_name]

        overfit = best['train_accuracy'] - best['test_accuracy']

        reasoning = []
        if best['test_accuracy'] > 0.52:
            reasoning.append("Shows predictive signal above random")
        if overfit < 0.05:
            reasoning.append("Low overfitting gap")
        if best['f1'] > 0.5:
            reasoning.append("Balanced precision/recall")

        return f"""Recommended: {best_name}
Test Accuracy: {best['test_accuracy']:.2%}
F1 Score: {best['f1']:.2%}
Reasoning: {'; '.join(reasoning) if reasoning else 'Best among options'}"""
# Exercise 6.5: Probability Calibration Analyzer (Open-ended)
#
# Build a ProbabilityCalibrator class that:
# - Takes a fitted classifier with predict_proba
# - Analyzes calibration using bins (reliability diagram)
# - Calculates Brier score
# - Implements isotonic or Platt scaling calibration
# - Compares calibrated vs uncalibrated probabilities
#
# Your implementation:
Solution 6.5
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss

class ProbabilityCalibrator:
    """Analyze and improve probability calibration."""

    def __init__(self, classifier, n_bins: int = 10):
        self.classifier = classifier
        self.n_bins = n_bins
        self.calibrated_clf = None

    def analyze_calibration(self, X: np.ndarray, y: pd.Series) -> Dict:
        """Analyze probability calibration."""
        probas = self.classifier.predict_proba(X)[:, 1]

        # Calibration curve
        prob_true, prob_pred = calibration_curve(y, probas, n_bins=self.n_bins)

        # Brier score
        brier = brier_score_loss(y, probas)

        return {
            'prob_true': prob_true,
            'prob_pred': prob_pred,
            'brier_score': brier,
            'probabilities': probas
        }

    def calibrate(self, X: np.ndarray, y: pd.Series,
                  method: str = 'isotonic') -> 'ProbabilityCalibrator':
        """Apply calibration to the classifier."""
        self.calibrated_clf = CalibratedClassifierCV(
            self.classifier,
            method=method,  # 'isotonic' or 'sigmoid'
            cv=3
        )
        self.calibrated_clf.fit(X, y)
        return self

    def compare(self, X: np.ndarray, y: pd.Series) -> pd.DataFrame:
        """Compare calibrated vs uncalibrated."""
        uncal_probas = self.classifier.predict_proba(X)[:, 1]
        cal_probas = self.calibrated_clf.predict_proba(X)[:, 1]

        uncal_brier = brier_score_loss(y, uncal_probas)
        cal_brier = brier_score_loss(y, cal_probas)

        uncal_acc = accuracy_score(y, (uncal_probas > 0.5).astype(int))
        cal_acc = accuracy_score(y, (cal_probas > 0.5).astype(int))

        return pd.DataFrame({
            'Metric': ['Brier Score', 'Accuracy'],
            'Uncalibrated': [uncal_brier, uncal_acc],
            'Calibrated': [cal_brier, cal_acc],
            'Improvement': [uncal_brier - cal_brier, cal_acc - uncal_acc]
        })

    def plot_calibration(self, X: np.ndarray, y: pd.Series):
        """Plot calibration curves."""
        uncal = self.analyze_calibration(X, y)

        plt.figure(figsize=(10, 8))

        # Perfect calibration line
        plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')

        # Uncalibrated
        plt.plot(uncal['prob_pred'], uncal['prob_true'],
                's-', label=f"Uncalibrated (Brier: {uncal['brier_score']:.4f})")

        # Calibrated
        if self.calibrated_clf:
            cal_probas = self.calibrated_clf.predict_proba(X)[:, 1]
            cal_true, cal_pred = calibration_curve(y, cal_probas, n_bins=self.n_bins)
            cal_brier = brier_score_loss(y, cal_probas)
            plt.plot(cal_pred, cal_true, 'o-',
                    label=f"Calibrated (Brier: {cal_brier:.4f})")

        plt.xlabel('Mean Predicted Probability')
        plt.ylabel('Fraction of Positives')
        plt.title('Probability Calibration Curve')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()
# Exercise 6.6: Trading Signal Pipeline Builder (Open-ended)
#
# Build a TradingSignalPipeline class that:
# - Combines preprocessing, feature scaling, and classification
# - Supports multiple classifier backends (logreg, svm, mlp)
# - Generates signals with confidence levels
# - Implements fit(), predict(), and predict_proba()
# - Has a get_signal_strength() method returning -1 to +1
# - Provides interpretability info (coefficients or feature importance)
#
# Your implementation:
Solution 6.6
class TradingSignalPipeline:
    """Complete pipeline for trading signal generation."""

    CLASSIFIERS = {
        'logreg': LogisticRegression(max_iter=1000, random_state=42),
        'svm': SVC(kernel='rbf', probability=True, random_state=42),
        'mlp': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500,
                            early_stopping=True, random_state=42)
    }

    def __init__(self, classifier_type: str = 'logreg'):
        if classifier_type not in self.CLASSIFIERS:
            raise ValueError(f"Unknown classifier: {classifier_type}")

        self.classifier_type = classifier_type
        self.scaler = StandardScaler()
        self.classifier = self.CLASSIFIERS[classifier_type]
        self.feature_names = None

    def fit(self, X: pd.DataFrame, y: pd.Series):
        """Fit the pipeline."""
        self.feature_names = X.columns.tolist()

        X_scaled = self.scaler.fit_transform(X)
        self.classifier.fit(X_scaled, y)

        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class labels."""
        X_scaled = self.scaler.transform(X)
        return self.classifier.predict(X_scaled)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class probabilities."""
        X_scaled = self.scaler.transform(X)
        return self.classifier.predict_proba(X_scaled)

    def get_signal_strength(self, X: pd.DataFrame) -> np.ndarray:
        """Get signal strength from -1 (strong sell) to +1 (strong buy)."""
        probas = self.predict_proba(X)
        # Map [0, 1] to [-1, 1]
        return (probas[:, 1] - 0.5) * 2

    def get_signals(self, X: pd.DataFrame) -> pd.DataFrame:
        """Get detailed signal information."""
        predictions = self.predict(X)
        probabilities = self.predict_proba(X)
        strength = self.get_signal_strength(X)

        return pd.DataFrame({
            'signal': predictions,
            'signal_name': pd.Series(predictions).map({0: 'SELL', 1: 'BUY'}),
            'prob_down': probabilities[:, 0],
            'prob_up': probabilities[:, 1],
            'strength': strength,
            'confidence': np.abs(strength)
        }, index=X.index)

    def get_interpretability(self) -> Optional[pd.DataFrame]:
        """Get model interpretability info."""
        if self.classifier_type == 'logreg':
            return pd.DataFrame({
                'feature': self.feature_names,
                'coefficient': self.classifier.coef_[0]
            }).sort_values('coefficient', key=abs, ascending=False)
        elif self.classifier_type == 'mlp':
            # Approximate importance from first layer weights
            weights = np.abs(self.classifier.coefs_[0]).mean(axis=1)
            return pd.DataFrame({
                'feature': self.feature_names,
                'importance': weights
            }).sort_values('importance', ascending=False)
        else:
            return None

    def score(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Calculate accuracy."""
        X_scaled = self.scaler.transform(X)
        return self.classifier.score(X_scaled, y)

Module Project: Multi-Model Trading Signal Ensemble

Build a complete trading system that combines multiple classification models.

class MultiModelTradingSystem:
    """
    Trading system combining multiple classification models.
    
    Uses logistic regression, SVM, and neural network for robust predictions.
    """
    
    def __init__(self):
        self.scaler = StandardScaler()
        self.models = {
            'logreg': LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42),
            'svm': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
            'mlp': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500,
                                early_stopping=True, random_state=42)
        }
        self.weights = {'logreg': 0.3, 'svm': 0.3, 'mlp': 0.4}
        self.feature_names = None
        
    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create features from OHLCV data."""
        features = pd.DataFrame(index=df.index)
        
        # Price features
        features['returns'] = df['Close'].pct_change()
        features['volatility'] = features['returns'].rolling(20).std()
        
        # Momentum
        for period in [5, 10, 20]:
            features[f'momentum_{period}'] = df['Close'].pct_change(period)
        
        # Moving average distances
        for period in [5, 20, 50]:
            ma = df['Close'].rolling(period).mean()
            features[f'dist_ma{period}'] = (df['Close'] - ma) / ma
        
        # RSI
        delta = df['Close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        features['rsi'] = 100 - (100 / (1 + gain / loss))
        features['rsi_normalized'] = (features['rsi'] - 50) / 50
        
        # Volume
        features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
        
        return features.dropna()
    
    def fit(self, df: pd.DataFrame, test_size: float = 0.2):
        """Fit all models on the data."""
        # Create features
        features = self.create_features(df)
        self.feature_names = features.columns.tolist()
        
        # Create target
        aligned_df = df.loc[features.index]
        target = (aligned_df['Close'].pct_change().shift(-1) > 0).astype(int)
        
        # Remove last row
        features = features[:-1]
        target = target[:-1]
        
        # Split
        split_idx = int(len(features) * (1 - test_size))
        X_train = features[:split_idx]
        X_test = features[split_idx:]
        y_train = target[:split_idx]
        y_test = target[split_idx:]
        
        # Scale
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        # Fit all models
        print("Training models...\n")
        
        for name, model in self.models.items():
            model.fit(X_train_scaled, y_train)
            train_acc = model.score(X_train_scaled, y_train)
            test_acc = model.score(X_test_scaled, y_test)
            print(f"{name:10s}: Train {train_acc:.2%}, Test {test_acc:.2%}")
        
        # Ensemble performance
        ensemble_pred = self._ensemble_predict(X_test_scaled)
        ensemble_acc = accuracy_score(y_test, ensemble_pred)
        print(f"{'Ensemble':10s}: Test {ensemble_acc:.2%}")
        
        return self
    
    def _ensemble_predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Weighted average of probabilities."""
        weighted_proba = np.zeros((len(X), 2))
        
        for name, model in self.models.items():
            weighted_proba += self.weights[name] * model.predict_proba(X)
        
        return weighted_proba
    
    def _ensemble_predict(self, X: np.ndarray) -> np.ndarray:
        """Ensemble prediction."""
        proba = self._ensemble_predict_proba(X)
        return (proba[:, 1] > 0.5).astype(int)
    
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Generate trading signals."""
        features = self.create_features(df)
        X_scaled = self.scaler.transform(features)
        
        # Get individual predictions
        signals = pd.DataFrame(index=features.index)
        
        for name, model in self.models.items():
            signals[f'{name}_prob'] = model.predict_proba(X_scaled)[:, 1]
            signals[f'{name}_signal'] = model.predict(X_scaled)
        
        # Ensemble
        ensemble_proba = self._ensemble_predict_proba(X_scaled)
        signals['ensemble_prob'] = ensemble_proba[:, 1]
        signals['ensemble_signal'] = self._ensemble_predict(X_scaled)
        
        # Signal strength and agreement
        signals['strength'] = (signals['ensemble_prob'] - 0.5) * 2
        signals['model_agreement'] = (
            signals['logreg_signal'] +
            signals['svm_signal'] +
            signals['mlp_signal']
        ) / 3
        
        return signals
    
    def get_feature_importance(self) -> pd.DataFrame:
        """Get feature importance from logistic regression."""
        return pd.DataFrame({
            'feature': self.feature_names,
            'coefficient': self.models['logreg'].coef_[0]
        }).sort_values('coefficient', key=abs, ascending=False)
    
    def backtest(self, df: pd.DataFrame) -> pd.DataFrame:
        """Simple backtest of the system."""
        signals = self.predict(df)
        
        # Get returns
        returns = df['Close'].pct_change().shift(-1)
        aligned_returns = returns.loc[signals.index]
        
        # Strategy returns
        signals['next_return'] = aligned_returns
        signals['strategy_return'] = signals['ensemble_signal'].shift(1) * signals['next_return']
        
        # Strength-weighted returns
        signals['weighted_return'] = signals['strength'].shift(1) * signals['next_return']
        
        # Cumulative
        signals['cum_strategy'] = (1 + signals['strategy_return'].fillna(0)).cumprod()
        signals['cum_weighted'] = (1 + signals['weighted_return'].fillna(0)).cumprod()
        signals['cum_bh'] = (1 + signals['next_return'].fillna(0)).cumprod()
        
        return signals
# Test the multi-model system

# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")

# Create and train system
system = MultiModelTradingSystem()
system.fit(data)
# Backtest and visualize

backtest = system.backtest(data)

# Plot results
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Cumulative returns
axes[0].plot(backtest['cum_strategy'].dropna(), label='Binary Strategy', linewidth=2)
axes[0].plot(backtest['cum_weighted'].dropna(), label='Weighted Strategy', linewidth=2)
axes[0].plot(backtest['cum_bh'].dropna(), label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Multi-Model Trading System Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Model agreement
axes[1].plot(backtest['model_agreement'].dropna(), alpha=0.7, linewidth=1)
axes[1].axhline(y=0.5, color='gray', linestyle='--')
axes[1].fill_between(backtest.index, 0, 1, where=backtest['model_agreement'] > 0.66,
                     alpha=0.3, color='green', label='Strong Buy')
axes[1].fill_between(backtest.index, 0, 1, where=backtest['model_agreement'] < 0.33,
                     alpha=0.3, color='red', label='Strong Sell')
axes[1].set_ylabel('Model Agreement (0-1)')
axes[1].set_title('Model Agreement Over Time')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Performance summary

strategy_return = backtest['cum_strategy'].iloc[-1] - 1
weighted_return = backtest['cum_weighted'].iloc[-1] - 1
bh_return = backtest['cum_bh'].iloc[-1] - 1

print("\nPerformance Summary:")
print(f"  Binary Strategy: {strategy_return:.2%}")
print(f"  Weighted Strategy: {weighted_return:.2%}")
print(f"  Buy & Hold: {bh_return:.2%}")
print(f"\n  Outperformance (Binary): {strategy_return - bh_return:.2%}")
print(f"  Outperformance (Weighted): {weighted_return - bh_return:.2%}")

# Feature importance
print("\nTop Features (from Logistic Regression):")
importance = system.get_feature_importance()
print(importance.head(5).to_string(index=False))

Key Takeaways

  1. Logistic Regression provides interpretable coefficients and probability outputs; regularization (L1/L2) prevents overfitting

  2. Support Vector Machines find maximum-margin boundaries; kernels (RBF, polynomial) capture non-linear patterns

  3. Neural Networks (MLP) learn complex patterns but require careful tuning and more data to avoid overfitting

  4. Feature scaling is crucial for SVM and neural networks; always scale before training

  5. Probability calibration matters for trading; well-calibrated probabilities improve position sizing

  6. Model ensembles often outperform individual models by combining diverse perspectives

  7. No single best model exists; the right choice depends on data characteristics and interpretability needs


Next: Module 7 - Model Evaluation (Classification metrics, financial metrics, ROC curves)

Module 7: Model Evaluation

Part 2: Classification Models

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-6

Learning Objectives

By the end of this module, you will be able to: - Calculate and interpret classification metrics (accuracy, precision, recall, F1) - Use confusion matrices for detailed error analysis - Apply ROC curves and AUC for threshold-independent evaluation - Evaluate models with financial metrics (returns, Sharpe ratio) - Implement proper walk-forward validation for realistic assessment

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, auc, roc_auc_score,
    precision_recall_curve, average_precision_score
)
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

import yfinance as yf

print("Module 7: Model Evaluation")
print("=" * 40)
# Prepare data and train a model for evaluation

def prepare_data(symbol: str = "SPY", period: str = "2y") -> Tuple:
    """Prepare features, target, and train/test splits."""
    ticker = yf.Ticker(symbol)
    df = ticker.history(period=period)
    
    # Features
    df['returns'] = df['Close'].pct_change()
    df['volatility'] = df['returns'].rolling(20).std()
    df['momentum_5'] = df['Close'].pct_change(5)
    df['momentum_20'] = df['Close'].pct_change(20)
    
    for period_len in [5, 20, 50]:
        ma = df['Close'].rolling(period_len).mean()
        df[f'dist_ma{period_len}'] = (df['Close'] - ma) / ma
    
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    df['rsi'] = 100 - (100 / (1 + gain / loss))
    df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    
    df['target'] = (df['returns'].shift(-1) > 0).astype(int)
    df = df.dropna()
    
    features = ['volatility', 'momentum_5', 'momentum_20', 'dist_ma5',
                'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
    
    X = df[features]
    y = df['target']
    
    split_idx = int(len(X) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    return X_train, X_test, y_train, y_test, df

# Load data
X_train, X_test, y_train, y_test, df = prepare_data()

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)

# Get predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

print(f"Test samples: {len(y_test)}")
print(f"Predictions shape: {y_pred.shape}")

Section 1: Classification Metrics

Understanding the fundamental metrics for evaluating classification models.

# Classification Metrics Overview

metrics_overview = """
CLASSIFICATION METRICS
======================

Confusion Matrix:
-----------------
                    Predicted
                  Neg     Pos
           Neg  [ TN  |  FP ]
  Actual   Pos  [ FN  |  TP ]

  TN = True Negative:  Correctly predicted DOWN
  TP = True Positive:  Correctly predicted UP
  FP = False Positive: Predicted UP, was DOWN (Type I error)
  FN = False Negative: Predicted DOWN, was UP (Type II error)

Key Metrics:
------------
  Accuracy = (TP + TN) / (TP + TN + FP + FN)
    → Overall correctness
    → Misleading for imbalanced data

  Precision = TP / (TP + FP)
    → "Of all predicted UP, how many were actually UP?"
    → High precision = few false alarms
    → Important when FP is costly (e.g., buying on wrong signal)

  Recall = TP / (TP + FN)
    → "Of all actual UP days, how many did we catch?"
    → High recall = don't miss opportunities
    → Important when FN is costly (e.g., missing big moves)

  F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
    → Harmonic mean of precision and recall
    → Balanced measure when both matter

Trading Context:
----------------
  High Precision Strategy: "Only trade when very confident"
  High Recall Strategy: "Never miss a move"
  Balanced (F1): "Reasonable trade-off"
"""
print(metrics_overview)
# Calculate basic metrics

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Classification Metrics:")
print(f"  Accuracy:  {accuracy:.4f} ({accuracy:.2%})")
print(f"  Precision: {precision:.4f} ({precision:.2%})")
print(f"  Recall:    {recall:.4f} ({recall:.2%})")
print(f"  F1 Score:  {f1:.4f} ({f1:.2%})")

# Random baseline
print(f"\n  Random Baseline: 50.00%")
print(f"  Improvement over random: {(accuracy - 0.5) * 100:.2f}pp")
# Full classification report

print("\nClassification Report:")
print("=" * 60)
print(classification_report(y_test, y_pred, target_names=['DOWN', 'UP']))
# Confusion matrix

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted DOWN', 'Predicted UP'],
            yticklabels=['Actual DOWN', 'Actual UP'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Extract values
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (DOWN → DOWN):  {tn}")
print(f"  False Positives (DOWN → UP):   {fp}  ← Wrong buy signals")
print(f"  False Negatives (UP → DOWN):   {fn}  ← Missed opportunities")
print(f"  True Positives (UP → UP):      {tp}")
# Exercise 7.1: Metrics Calculator (Guided)

def calculate_all_metrics(y_true: pd.Series, y_pred: np.ndarray,
                          y_proba: np.ndarray = None) -> Dict:
    """
    Calculate comprehensive classification metrics.
    
    Returns:
        Dictionary with all metrics
    """
    # Basic metrics
    metrics = {}
    
    # TODO: Calculate accuracy, precision, recall, f1
    metrics['accuracy'] = ______(y_true, y_pred)
    metrics['precision'] = ______(y_true, y_pred)
    metrics['recall'] = ______(y_true, y_pred)
    metrics['f1'] = ______(y_true, y_pred)
    
    # TODO: Get confusion matrix values
    cm = ______(y_true, y_pred)
    tn, fp, fn, tp = cm.______()
    
    metrics['true_negatives'] = tn
    metrics['false_positives'] = fp
    metrics['false_negatives'] = fn
    metrics['true_positives'] = tp
    
    # Specificity (true negative rate)
    metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    # ROC AUC if probabilities provided
    if y_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_proba[:, 1])
    
    return metrics

# Test the function
# metrics = calculate_all_metrics(y_test, y_pred, y_proba)
Solution 7.1
def calculate_all_metrics(y_true: pd.Series, y_pred: np.ndarray,
                          y_proba: np.ndarray = None) -> Dict:
    """
    Calculate comprehensive classification metrics.
    """
    metrics = {}

    metrics['accuracy'] = accuracy_score(y_true, y_pred)
    metrics['precision'] = precision_score(y_true, y_pred)
    metrics['recall'] = recall_score(y_true, y_pred)
    metrics['f1'] = f1_score(y_true, y_pred)

    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    metrics['true_negatives'] = tn
    metrics['false_positives'] = fp
    metrics['false_negatives'] = fn
    metrics['true_positives'] = tp

    metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0

    if y_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_proba[:, 1])

    return metrics

Section 2: ROC Curves and AUC

ROC curves provide threshold-independent model evaluation.

# ROC Curve Concepts

roc_concepts = """
ROC CURVES AND AUC
==================

What is ROC?
------------
ROC = Receiver Operating Characteristic
- Plots True Positive Rate vs False Positive Rate at different thresholds
- Shows trade-off between catching positives and creating false alarms

Axes:
-----
  Y-axis: True Positive Rate (TPR) = Recall = TP / (TP + FN)
  X-axis: False Positive Rate (FPR) = FP / (FP + TN)

Interpretation:
---------------
  - Diagonal line = random classifier (AUC = 0.5)
  - Upper left corner = perfect classifier (AUC = 1.0)
  - Curve above diagonal = better than random

AUC (Area Under Curve):
-----------------------
  - 1.0: Perfect classifier
  - 0.9-1.0: Excellent
  - 0.8-0.9: Good
  - 0.7-0.8: Fair
  - 0.5-0.7: Poor
  - 0.5: Random

Trading Context:
----------------
  AUC 0.5: Model has no predictive power
  AUC 0.55: Slight edge (may be profitable with good execution)
  AUC 0.60+: Strong signal (rare in liquid markets)
"""
print(roc_concepts)
# Calculate and plot ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 8))

# ROC curve
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label=f'ROC curve (AUC = {roc_auc:.4f})')

# Random baseline
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
         label='Random (AUC = 0.500)')

# Formatting
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC: {roc_auc:.4f}")
print(f"Interpretation: {'Better than random' if roc_auc > 0.5 else 'No predictive power'}")
# Optimal threshold selection

# Youden's J statistic: TPR - FPR
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal Threshold (Youden's J): {optimal_threshold:.4f}")
print(f"  At this threshold:")
print(f"  TPR (Recall): {tpr[optimal_idx]:.4f}")
print(f"  FPR: {fpr[optimal_idx]:.4f}")

# Apply optimal threshold
y_pred_optimal = (y_proba[:, 1] >= optimal_threshold).astype(int)
print(f"\nWith optimal threshold:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_optimal):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred_optimal):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_optimal):.4f}")
# Precision-Recall Curve (better for imbalanced data)

precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_test, y_proba[:, 1])
avg_precision = average_precision_score(y_test, y_proba[:, 1])

plt.figure(figsize=(10, 8))

plt.plot(recall_curve, precision_curve, color='blue', lw=2,
         label=f'PR curve (AP = {avg_precision:.4f})')

# Baseline (proportion of positive class)
baseline = y_test.mean()
plt.axhline(y=baseline, color='gray', linestyle='--',
            label=f'Random baseline ({baseline:.4f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True, alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.tight_layout()
plt.show()
# Exercise 7.2: ROC Analyzer (Guided)

def analyze_roc(y_true: pd.Series, y_proba: np.ndarray,
                target_fpr: float = 0.1) -> Dict:
    """
    Comprehensive ROC analysis.
    
    Args:
        y_true: True labels
        y_proba: Predicted probabilities (n_samples, 2)
        target_fpr: Target false positive rate for threshold
    
    Returns:
        Dictionary with ROC analysis results
    """
    # TODO: Calculate ROC curve
    fpr, tpr, thresholds = ______(y_true, y_proba[:, 1])
    
    # TODO: Calculate AUC
    roc_auc = ______(fpr, tpr)
    
    # Youden's optimal threshold
    j_scores = tpr - fpr
    optimal_idx = np.______(j_scores)
    optimal_threshold = thresholds[optimal_idx]
    
    # Threshold for target FPR
    target_idx = np.argmin(np.abs(fpr - target_fpr))
    target_threshold = thresholds[target_idx]
    
    return {
        'auc': roc_auc,
        'optimal_threshold': optimal_threshold,
        'optimal_tpr': tpr[optimal_idx],
        'optimal_fpr': fpr[optimal_idx],
        'target_threshold': target_threshold,
        'target_tpr': tpr[target_idx],
        'fpr': fpr,
        'tpr': tpr,
        'thresholds': thresholds
    }

# Test the function
# roc_analysis = analyze_roc(y_test, y_proba)
Solution 7.2
def analyze_roc(y_true: pd.Series, y_proba: np.ndarray,
                target_fpr: float = 0.1) -> Dict:
    """
    Comprehensive ROC analysis.
    """
    fpr, tpr, thresholds = roc_curve(y_true, y_proba[:, 1])

    roc_auc = auc(fpr, tpr)

    j_scores = tpr - fpr
    optimal_idx = np.argmax(j_scores)
    optimal_threshold = thresholds[optimal_idx]

    target_idx = np.argmin(np.abs(fpr - target_fpr))
    target_threshold = thresholds[target_idx]

    return {
        'auc': roc_auc,
        'optimal_threshold': optimal_threshold,
        'optimal_tpr': tpr[optimal_idx],
        'optimal_fpr': fpr[optimal_idx],
        'target_threshold': target_threshold,
        'target_tpr': tpr[target_idx],
        'fpr': fpr,
        'tpr': tpr,
        'thresholds': thresholds
    }

Section 3: Financial Metrics

ML metrics don't tell the whole story - we need financial performance metrics.

# Financial Metrics Overview

financial_metrics = """
FINANCIAL METRICS FOR ML MODELS
================================

Why Financial Metrics Matter:
-----------------------------
- High accuracy doesn't mean profits
- Correct predictions on small moves vs wrong on big moves
- Transaction costs and slippage
- Risk-adjusted returns matter

Key Financial Metrics:
----------------------
1. Total Return
   Strategy return vs buy-and-hold

2. Sharpe Ratio
   (Return - Risk Free) / Volatility
   > 1.0 is good, > 2.0 is excellent

3. Maximum Drawdown
   Largest peak-to-trough decline
   Lower is better

4. Win Rate
   Percentage of profitable trades
   (Different from ML accuracy!)

5. Profit Factor
   Gross Profit / Gross Loss
   > 1.0 means profitable

6. Average Win/Loss Ratio
   Average winning trade / Average losing trade

7. Calmar Ratio
   Annual Return / Max Drawdown

Accuracy vs Profitability:
--------------------------
  Model A: 60% accuracy, predicts small moves correctly
  Model B: 45% accuracy, predicts big moves correctly
  
  Model B can be MORE profitable!
"""
print(financial_metrics)
# Calculate financial metrics for our model

def calculate_financial_metrics(y_true: pd.Series, y_pred: np.ndarray,
                                 returns: pd.Series, risk_free: float = 0.02) -> Dict:
    """
    Calculate financial performance metrics.
    
    Args:
        y_true: True labels
        y_pred: Predicted labels
        returns: Actual returns series (aligned with predictions)
        risk_free: Annual risk-free rate
    """
    # Align data
    pred_series = pd.Series(y_pred, index=y_true.index)
    aligned_returns = returns.loc[y_true.index]
    
    # Strategy returns (long when predicted up, flat when predicted down)
    strategy_returns = pred_series.shift(1) * aligned_returns
    strategy_returns = strategy_returns.dropna()
    
    # Buy and hold returns
    bh_returns = aligned_returns
    
    # Cumulative returns
    cum_strategy = (1 + strategy_returns).cumprod().iloc[-1] - 1
    cum_bh = (1 + bh_returns).cumprod().iloc[-1] - 1
    
    # Sharpe Ratio (annualized)
    daily_rf = (1 + risk_free) ** (1/252) - 1
    excess_returns = strategy_returns - daily_rf
    sharpe = np.sqrt(252) * excess_returns.mean() / excess_returns.std()
    
    # Maximum Drawdown
    cum_returns = (1 + strategy_returns).cumprod()
    running_max = cum_returns.expanding().max()
    drawdown = (cum_returns - running_max) / running_max
    max_drawdown = drawdown.min()
    
    # Win Rate (on actual trades)
    trades = strategy_returns[pred_series.shift(1) == 1]
    win_rate = (trades > 0).mean() if len(trades) > 0 else 0
    
    # Profit Factor
    gains = trades[trades > 0].sum()
    losses = abs(trades[trades < 0].sum())
    profit_factor = gains / losses if losses > 0 else np.inf
    
    return {
        'total_return': cum_strategy,
        'bh_return': cum_bh,
        'outperformance': cum_strategy - cum_bh,
        'sharpe_ratio': sharpe,
        'max_drawdown': max_drawdown,
        'win_rate': win_rate,
        'profit_factor': profit_factor,
        'n_trades': len(trades)
    }

# Calculate
returns = df['Close'].pct_change().shift(-1)
test_returns = returns.loc[y_test.index]

financial = calculate_financial_metrics(y_test, y_pred, test_returns)

print("Financial Performance Metrics:")
print(f"  Total Return: {financial['total_return']:.2%}")
print(f"  Buy & Hold:   {financial['bh_return']:.2%}")
print(f"  Outperformance: {financial['outperformance']:.2%}")
print(f"\n  Sharpe Ratio: {financial['sharpe_ratio']:.2f}")
print(f"  Max Drawdown: {financial['max_drawdown']:.2%}")
print(f"\n  Win Rate: {financial['win_rate']:.2%}")
print(f"  Profit Factor: {financial['profit_factor']:.2f}")
print(f"  Number of Trades: {financial['n_trades']}")
# Visualize strategy performance

# Calculate cumulative returns
pred_series = pd.Series(y_pred, index=y_test.index)
strategy_returns = pred_series.shift(1) * test_returns
strategy_returns = strategy_returns.dropna()

cum_strategy = (1 + strategy_returns).cumprod()
cum_bh = (1 + test_returns.loc[strategy_returns.index]).cumprod()

fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Cumulative returns
axes[0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
axes[0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Strategy vs Buy & Hold Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Drawdown
running_max = cum_strategy.expanding().max()
drawdown = (cum_strategy - running_max) / running_max * 100

axes[1].fill_between(drawdown.index, drawdown, 0, color='red', alpha=0.3)
axes[1].plot(drawdown.index, drawdown, color='red', linewidth=1)
axes[1].set_ylabel('Drawdown (%)')
axes[1].set_title('Strategy Drawdown')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Exercise 7.3: Complete Financial Evaluator (Guided)

def evaluate_trading_strategy(predictions: np.ndarray, returns: pd.Series,
                               index: pd.DatetimeIndex) -> pd.DataFrame:
    """
    Evaluate trading strategy with comprehensive metrics.
    
    Returns:
        DataFrame with daily performance and summary statistics
    """
    # Create aligned series
    pred_series = pd.Series(predictions, index=index)
    aligned_returns = returns.loc[index]
    
    # TODO: Calculate strategy returns
    strategy_returns = pred_series.______(1) * aligned_returns
    
    # Build results DataFrame
    results = pd.DataFrame(index=index)
    results['prediction'] = pred_series
    results['actual_return'] = aligned_returns
    results['strategy_return'] = strategy_returns
    
    # TODO: Calculate cumulative returns
    results['cum_strategy'] = (1 + results['strategy_return'].fillna(0)).______()
    results['cum_bh'] = (1 + results['actual_return'].fillna(0)).______()
    
    # Drawdown calculation
    running_max = results['cum_strategy'].expanding().max()
    results['drawdown'] = (results['cum_strategy'] - running_max) / running_max
    
    return results

# Test the function
# eval_results = evaluate_trading_strategy(y_pred, test_returns, y_test.index)
Solution 7.3
def evaluate_trading_strategy(predictions: np.ndarray, returns: pd.Series,
                               index: pd.DatetimeIndex) -> pd.DataFrame:
    """
    Evaluate trading strategy with comprehensive metrics.
    """
    pred_series = pd.Series(predictions, index=index)
    aligned_returns = returns.loc[index]

    strategy_returns = pred_series.shift(1) * aligned_returns

    results = pd.DataFrame(index=index)
    results['prediction'] = pred_series
    results['actual_return'] = aligned_returns
    results['strategy_return'] = strategy_returns

    results['cum_strategy'] = (1 + results['strategy_return'].fillna(0)).cumprod()
    results['cum_bh'] = (1 + results['actual_return'].fillna(0)).cumprod()

    running_max = results['cum_strategy'].expanding().max()
    results['drawdown'] = (results['cum_strategy'] - running_max) / running_max

    return results

Section 4: Walk-Forward Validation

Proper time-series validation for realistic performance assessment.

# Walk-Forward Validation Concepts

wf_concepts = """
WALK-FORWARD VALIDATION
=======================

Why Walk-Forward?
-----------------
- Standard k-fold CV uses future data to predict past (leakage!)
- Time series requires respecting temporal order
- Simulates real trading: train on past, predict future

Types of Time Series CV:
------------------------

1. Expanding Window:
   Train: [====]        → Test: [=]
   Train: [=====]       → Test: [=]
   Train: [======]      → Test: [=]

2. Rolling Window (Fixed):
   Train: [====]        → Test: [=]
         Train: [====]  → Test: [=]
              Train: [====] → Test: [=]

3. Purging & Embargo:
   Train: [====]___Gap___Test: [=]
   - Purge: Remove training data too close to test
   - Embargo: Gap between train and test
   - Prevents label leakage in overlapping features

Walk-Forward Process:
---------------------
1. Define initial training window
2. Train model on training window
3. Make predictions on test window
4. Roll forward by step size
5. Repeat until end of data
6. Aggregate all out-of-sample predictions
"""
print(wf_concepts)
# Walk-Forward Validation Implementation

class WalkForwardValidator:
    """Walk-forward validation for time series ML."""
    
    def __init__(self, train_size: int = 252, test_size: int = 21,
                 step_size: int = 21, purge_size: int = 0):
        """
        Args:
            train_size: Number of days in training window
            test_size: Number of days in test window
            step_size: Number of days to step forward
            purge_size: Number of days to purge between train and test
        """
        self.train_size = train_size
        self.test_size = test_size
        self.step_size = step_size
        self.purge_size = purge_size
        
    def split(self, X: pd.DataFrame) -> List[Tuple[np.ndarray, np.ndarray]]:
        """Generate train/test splits."""
        n = len(X)
        splits = []
        
        start = 0
        while start + self.train_size + self.purge_size + self.test_size <= n:
            train_end = start + self.train_size
            test_start = train_end + self.purge_size
            test_end = test_start + self.test_size
            
            train_idx = np.arange(start, train_end)
            test_idx = np.arange(test_start, test_end)
            
            splits.append((train_idx, test_idx))
            start += self.step_size
            
        return splits
    
    def validate(self, model, X: pd.DataFrame, y: pd.Series,
                 scaler=None) -> Dict:
        """Run walk-forward validation."""
        splits = self.split(X)
        
        all_predictions = []
        all_actuals = []
        all_probas = []
        fold_metrics = []
        
        for i, (train_idx, test_idx) in enumerate(splits):
            X_train = X.iloc[train_idx]
            X_test = X.iloc[test_idx]
            y_train = y.iloc[train_idx]
            y_test = y.iloc[test_idx]
            
            # Scale if scaler provided
            if scaler:
                scaler_fold = scaler.__class__()
                X_train = scaler_fold.fit_transform(X_train)
                X_test = scaler_fold.transform(X_test)
            
            # Train and predict
            model_fold = model.__class__(**model.get_params())
            model_fold.fit(X_train, y_train)
            
            pred = model_fold.predict(X_test)
            proba = model_fold.predict_proba(X_test)
            
            all_predictions.extend(pred)
            all_actuals.extend(y_test.values)
            all_probas.extend(proba[:, 1])
            
            fold_metrics.append({
                'fold': i,
                'train_start': X.index[train_idx[0]],
                'train_end': X.index[train_idx[-1]],
                'test_start': X.index[test_idx[0]],
                'test_end': X.index[test_idx[-1]],
                'accuracy': accuracy_score(y_test, pred),
                'auc': roc_auc_score(y_test, proba[:, 1])
            })
        
        return {
            'predictions': np.array(all_predictions),
            'actuals': np.array(all_actuals),
            'probas': np.array(all_probas),
            'fold_metrics': pd.DataFrame(fold_metrics)
        }

# Run walk-forward validation
X_full = pd.concat([X_train, X_test])
y_full = pd.concat([y_train, y_test])

wf = WalkForwardValidator(train_size=200, test_size=20, step_size=20)
wf_results = wf.validate(model, X_full, y_full, scaler=StandardScaler())

print(f"Walk-Forward Validation:")
print(f"  Total folds: {len(wf_results['fold_metrics'])}")
print(f"  Total predictions: {len(wf_results['predictions'])}")
print(f"\n  Overall Accuracy: {accuracy_score(wf_results['actuals'], wf_results['predictions']):.4f}")
print(f"  Overall AUC: {roc_auc_score(wf_results['actuals'], wf_results['probas']):.4f}")
# Visualize walk-forward results

fold_df = wf_results['fold_metrics']

fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Accuracy across folds
axes[0].bar(fold_df['fold'], fold_df['accuracy'], color='steelblue', alpha=0.7)
axes[0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[0].axhline(y=fold_df['accuracy'].mean(), color='green', linestyle='-',
                label=f'Mean: {fold_df["accuracy"].mean():.2%}')
axes[0].set_xlabel('Fold')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Walk-Forward Accuracy by Fold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# AUC across folds
axes[1].bar(fold_df['fold'], fold_df['auc'], color='forestgreen', alpha=0.7)
axes[1].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[1].axhline(y=fold_df['auc'].mean(), color='blue', linestyle='-',
                label=f'Mean: {fold_df["auc"].mean():.4f}')
axes[1].set_xlabel('Fold')
axes[1].set_ylabel('AUC')
axes[1].set_title('Walk-Forward AUC by Fold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nFold-level Metrics Summary:")
print(f"  Accuracy: {fold_df['accuracy'].mean():.4f} +/- {fold_df['accuracy'].std():.4f}")
print(f"  AUC: {fold_df['auc'].mean():.4f} +/- {fold_df['auc'].std():.4f}")
# Exercise 7.4: Complete Model Evaluator (Open-ended)
#
# Build a ModelEvaluator class that:
# - Calculates all classification metrics (accuracy, precision, recall, F1, AUC)
# - Calculates financial metrics (return, Sharpe, drawdown, win rate)
# - Supports walk-forward validation
# - Generates a comprehensive report
# - Creates visualization plots (ROC, confusion matrix, returns)
#
# Your implementation:
Solution 7.4
class ModelEvaluator:
    """Comprehensive model evaluation for trading ML."""

    def __init__(self, model, X_train, y_train, X_test, y_test, returns):
        self.model = model
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.returns = returns

        # Predictions
        self.y_pred = model.predict(X_test)
        self.y_proba = model.predict_proba(X_test)

    def get_classification_metrics(self) -> Dict:
        """Calculate all classification metrics."""
        return {
            'accuracy': accuracy_score(self.y_test, self.y_pred),
            'precision': precision_score(self.y_test, self.y_pred),
            'recall': recall_score(self.y_test, self.y_pred),
            'f1': f1_score(self.y_test, self.y_pred),
            'roc_auc': roc_auc_score(self.y_test, self.y_proba[:, 1])
        }

    def get_financial_metrics(self) -> Dict:
        """Calculate financial performance metrics."""
        test_returns = self.returns.loc[self.y_test.index]
        pred_series = pd.Series(self.y_pred, index=self.y_test.index)

        strategy_returns = pred_series.shift(1) * test_returns
        strategy_returns = strategy_returns.dropna()

        cum_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
        sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()

        cum_rets = (1 + strategy_returns).cumprod()
        running_max = cum_rets.expanding().max()
        max_dd = ((cum_rets - running_max) / running_max).min()

        trades = strategy_returns[pred_series.shift(1) == 1]
        win_rate = (trades > 0).mean() if len(trades) > 0 else 0

        return {
            'total_return': cum_return,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_dd,
            'win_rate': win_rate,
            'n_trades': len(trades)
        }

    def walk_forward_validate(self, train_size: int = 200,
                               test_size: int = 20) -> Dict:
        """Run walk-forward validation."""
        X_full = pd.concat([self.X_train, self.X_test])
        y_full = pd.concat([self.y_train, self.y_test])

        wf = WalkForwardValidator(train_size, test_size, test_size)
        return wf.validate(self.model, X_full, y_full, StandardScaler())

    def generate_report(self) -> pd.DataFrame:
        """Generate comprehensive report."""
        clf_metrics = self.get_classification_metrics()
        fin_metrics = self.get_financial_metrics()

        all_metrics = {**clf_metrics, **fin_metrics}

        return pd.DataFrame({
            'Metric': list(all_metrics.keys()),
            'Value': [f'{v:.4f}' if isinstance(v, float) else str(v)
                      for v in all_metrics.values()]
        })

    def plot_all(self):
        """Generate all visualization plots."""
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        # ROC Curve
        fpr, tpr, _ = roc_curve(self.y_test, self.y_proba[:, 1])
        axes[0, 0].plot(fpr, tpr, 'b-', lw=2)
        axes[0, 0].plot([0, 1], [0, 1], 'r--')
        axes[0, 0].set_title(f'ROC Curve (AUC={auc(fpr, tpr):.3f})')
        axes[0, 0].set_xlabel('FPR')
        axes[0, 0].set_ylabel('TPR')

        # Confusion Matrix
        cm = confusion_matrix(self.y_test, self.y_pred)
        sns.heatmap(cm, annot=True, fmt='d', ax=axes[0, 1], cmap='Blues')
        axes[0, 1].set_title('Confusion Matrix')

        # Cumulative Returns
        test_returns = self.returns.loc[self.y_test.index]
        pred_series = pd.Series(self.y_pred, index=self.y_test.index)
        strategy_rets = (pred_series.shift(1) * test_returns).fillna(0)

        cum_strategy = (1 + strategy_rets).cumprod()
        cum_bh = (1 + test_returns.fillna(0)).cumprod()

        axes[1, 0].plot(cum_strategy, label='Strategy')
        axes[1, 0].plot(cum_bh, label='Buy & Hold', alpha=0.7)
        axes[1, 0].set_title('Cumulative Returns')
        axes[1, 0].legend()

        # Metrics Summary
        report = self.generate_report()
        axes[1, 1].axis('off')
        table = axes[1, 1].table(
            cellText=report.values,
            colLabels=report.columns,
            loc='center',
            cellLoc='left'
        )
        table.auto_set_font_size(False)
        table.set_fontsize(10)
        axes[1, 1].set_title('Performance Metrics')

        plt.tight_layout()
        plt.show()
# Exercise 7.5: Threshold Optimizer (Open-ended)
#
# Build a ThresholdOptimizer class that:
# - Takes predicted probabilities and actual labels
# - Finds optimal threshold for different objectives:
#   - Maximize accuracy
#   - Maximize F1 score
#   - Maximize financial returns
#   - Balance precision and recall
# - Visualizes trade-offs at different thresholds
# - Returns recommendations with reasoning
#
# Your implementation:
Solution 7.5
class ThresholdOptimizer:
    """Optimize classification threshold for different objectives."""

    def __init__(self, y_true: pd.Series, y_proba: np.ndarray, returns: pd.Series = None):
        self.y_true = y_true
        self.y_proba = y_proba[:, 1] if y_proba.ndim > 1 else y_proba
        self.returns = returns
        self.thresholds = np.linspace(0.01, 0.99, 99)

    def _evaluate_threshold(self, threshold: float) -> Dict:
        """Evaluate metrics at given threshold."""
        y_pred = (self.y_proba >= threshold).astype(int)

        metrics = {
            'threshold': threshold,
            'accuracy': accuracy_score(self.y_true, y_pred),
            'precision': precision_score(self.y_true, y_pred, zero_division=0),
            'recall': recall_score(self.y_true, y_pred, zero_division=0),
            'f1': f1_score(self.y_true, y_pred, zero_division=0),
            'n_predictions': sum(y_pred)
        }

        if self.returns is not None:
            pred_series = pd.Series(y_pred, index=self.y_true.index)
            strat_rets = (pred_series.shift(1) * self.returns.loc[self.y_true.index]).dropna()
            metrics['total_return'] = (1 + strat_rets).cumprod().iloc[-1] - 1 if len(strat_rets) > 0 else 0

        return metrics

    def optimize(self, objective: str = 'f1') -> Dict:
        """Find optimal threshold for given objective."""
        results = [self._evaluate_threshold(t) for t in self.thresholds]
        df = pd.DataFrame(results)

        if objective in df.columns:
            best_idx = df[objective].idxmax()
            return {
                'optimal_threshold': df.loc[best_idx, 'threshold'],
                'optimal_value': df.loc[best_idx, objective],
                'metrics_at_optimal': df.loc[best_idx].to_dict(),
                'all_results': df
            }
        else:
            raise ValueError(f"Unknown objective: {objective}")

    def plot_tradeoffs(self):
        """Visualize metrics across thresholds."""
        results = [self._evaluate_threshold(t) for t in self.thresholds]
        df = pd.DataFrame(results)

        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        # Accuracy and F1
        axes[0, 0].plot(df['threshold'], df['accuracy'], label='Accuracy')
        axes[0, 0].plot(df['threshold'], df['f1'], label='F1')
        axes[0, 0].set_xlabel('Threshold')
        axes[0, 0].set_title('Accuracy and F1 vs Threshold')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)

        # Precision and Recall
        axes[0, 1].plot(df['threshold'], df['precision'], label='Precision')
        axes[0, 1].plot(df['threshold'], df['recall'], label='Recall')
        axes[0, 1].set_xlabel('Threshold')
        axes[0, 1].set_title('Precision and Recall vs Threshold')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)

        # Number of predictions
        axes[1, 0].plot(df['threshold'], df['n_predictions'])
        axes[1, 0].set_xlabel('Threshold')
        axes[1, 0].set_ylabel('Number of Positive Predictions')
        axes[1, 0].set_title('Trade Frequency vs Threshold')
        axes[1, 0].grid(True, alpha=0.3)

        # Returns if available
        if 'total_return' in df.columns:
            axes[1, 1].plot(df['threshold'], df['total_return'])
            axes[1, 1].set_xlabel('Threshold')
            axes[1, 1].set_ylabel('Total Return')
            axes[1, 1].set_title('Returns vs Threshold')
            axes[1, 1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

    def recommend(self) -> str:
        """Provide threshold recommendation."""
        acc_opt = self.optimize('accuracy')
        f1_opt = self.optimize('f1')

        rec = f"""Threshold Recommendations:

1. For Maximum Accuracy: {acc_opt['optimal_threshold']:.3f}
   Accuracy: {acc_opt['optimal_value']:.4f}

2. For Maximum F1: {f1_opt['optimal_threshold']:.3f}
   F1: {f1_opt['optimal_value']:.4f}"""

        if self.returns is not None:
            ret_opt = self.optimize('total_return')
            rec += f"""

3. For Maximum Returns: {ret_opt['optimal_threshold']:.3f}
   Return: {ret_opt['optimal_value']:.4f}"""

        return rec
# Exercise 7.6: Model Comparison Dashboard (Open-ended)
#
# Build a ModelComparisonDashboard class that:
# - Takes multiple trained models
# - Compares them on all metrics (ML and financial)
# - Generates comparison tables and plots
# - Ranks models by different criteria
# - Provides a final recommendation
# - Exports results to a report
#
# Your implementation:
Solution 7.6
class ModelComparisonDashboard:
    """Compare multiple models comprehensively."""

    def __init__(self, models: Dict, X_train, y_train, X_test, y_test, returns):
        self.models = models
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.returns = returns
        self.results = {}

    def evaluate_all(self):
        """Evaluate all models."""
        for name, model in self.models.items():
            # Get predictions
            y_pred = model.predict(self.X_test)
            y_proba = model.predict_proba(self.X_test)

            # ML metrics
            ml_metrics = {
                'accuracy': accuracy_score(self.y_test, y_pred),
                'precision': precision_score(self.y_test, y_pred),
                'recall': recall_score(self.y_test, y_pred),
                'f1': f1_score(self.y_test, y_pred),
                'auc': roc_auc_score(self.y_test, y_proba[:, 1])
            }

            # Financial metrics
            test_returns = self.returns.loc[self.y_test.index]
            pred_series = pd.Series(y_pred, index=self.y_test.index)
            strat_rets = (pred_series.shift(1) * test_returns).dropna()

            fin_metrics = {
                'total_return': (1 + strat_rets).cumprod().iloc[-1] - 1,
                'sharpe': np.sqrt(252) * strat_rets.mean() / strat_rets.std() if strat_rets.std() > 0 else 0,
                'win_rate': (strat_rets[pred_series.shift(1) == 1] > 0).mean()
            }

            self.results[name] = {**ml_metrics, **fin_metrics}

        return self

    def get_comparison_table(self) -> pd.DataFrame:
        """Get comparison table."""
        return pd.DataFrame(self.results).T

    def rank_models(self, by: str = 'f1') -> pd.DataFrame:
        """Rank models by specific metric."""
        df = self.get_comparison_table()
        return df.sort_values(by, ascending=False)

    def plot_comparison(self):
        """Plot model comparison."""
        df = self.get_comparison_table()

        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        # Accuracy comparison
        df['accuracy'].plot(kind='bar', ax=axes[0, 0], color='steelblue')
        axes[0, 0].set_title('Accuracy')
        axes[0, 0].axhline(y=0.5, color='red', linestyle='--')

        # AUC comparison
        df['auc'].plot(kind='bar', ax=axes[0, 1], color='forestgreen')
        axes[0, 1].set_title('AUC')
        axes[0, 1].axhline(y=0.5, color='red', linestyle='--')

        # Returns comparison
        df['total_return'].plot(kind='bar', ax=axes[1, 0], color='darkorange')
        axes[1, 0].set_title('Total Return')
        axes[1, 0].axhline(y=0, color='gray', linestyle='--')

        # Sharpe comparison
        df['sharpe'].plot(kind='bar', ax=axes[1, 1], color='purple')
        axes[1, 1].set_title('Sharpe Ratio')
        axes[1, 1].axhline(y=0, color='gray', linestyle='--')

        for ax in axes.flat:
            ax.tick_params(axis='x', rotation=45)

        plt.tight_layout()
        plt.show()

    def recommend(self) -> str:
        """Recommend best model."""
        df = self.get_comparison_table()

        best_ml = df['f1'].idxmax()
        best_financial = df['total_return'].idxmax()

        # Score models on normalized metrics
        normalized = (df - df.min()) / (df.max() - df.min())
        combined_score = normalized.mean(axis=1)
        best_overall = combined_score.idxmax()

        return f"""Model Recommendations:

Best ML Performance: {best_ml}
  F1 Score: {df.loc[best_ml, 'f1']:.4f}
  AUC: {df.loc[best_ml, 'auc']:.4f}

Best Financial Performance: {best_financial}
  Total Return: {df.loc[best_financial, 'total_return']:.4f}
  Sharpe Ratio: {df.loc[best_financial, 'sharpe']:.2f}

Best Overall (Balanced): {best_overall}
  Combined Score: {combined_score[best_overall]:.4f}"""

    def export_report(self, filepath: str = 'model_comparison.csv'):
        """Export comparison to CSV."""
        df = self.get_comparison_table()
        df.to_csv(filepath)
        print(f"Report exported to {filepath}")

Module Project: Complete Model Evaluation Pipeline

Build a comprehensive evaluation system that combines all concepts.

class MLTradingEvaluator:
    """
    Complete evaluation pipeline for ML trading models.
    
    Combines classification metrics, financial metrics,
    walk-forward validation, and threshold optimization.
    """
    
    def __init__(self, model):
        self.model = model
        self.scaler = StandardScaler()
        self.evaluation_results = {}
        
    def prepare_data(self, df: pd.DataFrame, test_size: float = 0.2):
        """Prepare features and split data."""
        # Features
        features = pd.DataFrame(index=df.index)
        features['returns'] = df['Close'].pct_change()
        features['volatility'] = features['returns'].rolling(20).std()
        
        for p in [5, 10, 20]:
            features[f'momentum_{p}'] = df['Close'].pct_change(p)
        
        for p in [5, 20, 50]:
            ma = df['Close'].rolling(p).mean()
            features[f'dist_ma{p}'] = (df['Close'] - ma) / ma
        
        delta = df['Close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        features['rsi'] = 100 - (100 / (1 + gain / loss))
        features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
        
        # Target and returns
        features['target'] = (features['returns'].shift(-1) > 0).astype(int)
        features['next_return'] = features['returns'].shift(-1)
        
        features = features.dropna()
        
        # Feature columns
        feature_cols = [c for c in features.columns if c not in ['target', 'next_return']]
        
        # Split
        split_idx = int(len(features) * (1 - test_size))
        
        self.X_train = features[feature_cols][:split_idx]
        self.X_test = features[feature_cols][split_idx:]
        self.y_train = features['target'][:split_idx]
        self.y_test = features['target'][split_idx:]
        self.returns = features['next_return']
        
        return self
    
    def train_and_predict(self):
        """Train model and get predictions."""
        X_train_scaled = self.scaler.fit_transform(self.X_train)
        X_test_scaled = self.scaler.transform(self.X_test)
        
        self.model.fit(X_train_scaled, self.y_train)
        
        self.y_pred = self.model.predict(X_test_scaled)
        self.y_proba = self.model.predict_proba(X_test_scaled)
        
        return self
    
    def evaluate_classification(self) -> Dict:
        """Calculate classification metrics."""
        metrics = {
            'accuracy': accuracy_score(self.y_test, self.y_pred),
            'precision': precision_score(self.y_test, self.y_pred),
            'recall': recall_score(self.y_test, self.y_pred),
            'f1': f1_score(self.y_test, self.y_pred),
            'roc_auc': roc_auc_score(self.y_test, self.y_proba[:, 1])
        }
        self.evaluation_results['classification'] = metrics
        return metrics
    
    def evaluate_financial(self) -> Dict:
        """Calculate financial metrics."""
        test_returns = self.returns.loc[self.y_test.index]
        pred_series = pd.Series(self.y_pred, index=self.y_test.index)
        
        strategy_returns = pred_series.shift(1) * test_returns
        strategy_returns = strategy_returns.dropna()
        
        cum_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
        cum_bh = (1 + test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
        
        sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
        
        cum_rets = (1 + strategy_returns).cumprod()
        running_max = cum_rets.expanding().max()
        max_dd = ((cum_rets - running_max) / running_max).min()
        
        trades = strategy_returns[pred_series.shift(1) == 1]
        win_rate = (trades > 0).mean() if len(trades) > 0 else 0
        
        metrics = {
            'total_return': cum_return,
            'buy_hold_return': cum_bh,
            'outperformance': cum_return - cum_bh,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_dd,
            'win_rate': win_rate,
            'n_trades': len(trades)
        }
        self.evaluation_results['financial'] = metrics
        return metrics
    
    def run_full_evaluation(self) -> pd.DataFrame:
        """Run complete evaluation."""
        clf_metrics = self.evaluate_classification()
        fin_metrics = self.evaluate_financial()
        
        all_metrics = []
        
        for name, value in clf_metrics.items():
            all_metrics.append({'Category': 'Classification', 'Metric': name, 'Value': f'{value:.4f}'})
        
        for name, value in fin_metrics.items():
            if isinstance(value, float):
                all_metrics.append({'Category': 'Financial', 'Metric': name, 'Value': f'{value:.4f}'})
            else:
                all_metrics.append({'Category': 'Financial', 'Metric': name, 'Value': str(value)})
        
        return pd.DataFrame(all_metrics)
    
    def plot_evaluation(self):
        """Create evaluation visualization."""
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # ROC Curve
        fpr, tpr, _ = roc_curve(self.y_test, self.y_proba[:, 1])
        axes[0, 0].plot(fpr, tpr, 'b-', lw=2, label=f'AUC = {auc(fpr, tpr):.3f}')
        axes[0, 0].plot([0, 1], [0, 1], 'r--')
        axes[0, 0].set_title('ROC Curve')
        axes[0, 0].set_xlabel('False Positive Rate')
        axes[0, 0].set_ylabel('True Positive Rate')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # Confusion Matrix
        cm = confusion_matrix(self.y_test, self.y_pred)
        sns.heatmap(cm, annot=True, fmt='d', ax=axes[0, 1], cmap='Blues',
                   xticklabels=['DOWN', 'UP'], yticklabels=['DOWN', 'UP'])
        axes[0, 1].set_title('Confusion Matrix')
        axes[0, 1].set_xlabel('Predicted')
        axes[0, 1].set_ylabel('Actual')
        
        # Cumulative Returns
        test_returns = self.returns.loc[self.y_test.index]
        pred_series = pd.Series(self.y_pred, index=self.y_test.index)
        strategy_rets = (pred_series.shift(1) * test_returns).fillna(0)
        
        cum_strategy = (1 + strategy_rets).cumprod()
        cum_bh = (1 + test_returns.fillna(0)).cumprod()
        
        axes[1, 0].plot(cum_strategy.index, cum_strategy, label='Strategy', lw=2)
        axes[1, 0].plot(cum_bh.index, cum_bh, label='Buy & Hold', lw=2, alpha=0.7)
        axes[1, 0].set_title('Cumulative Returns')
        axes[1, 0].set_ylabel('Cumulative Return')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # Metrics Summary
        metrics_text = []
        for cat, mets in self.evaluation_results.items():
            metrics_text.append(f"\n{cat.upper()}:")
            for k, v in mets.items():
                if isinstance(v, float):
                    metrics_text.append(f"  {k}: {v:.4f}")
                else:
                    metrics_text.append(f"  {k}: {v}")
        
        axes[1, 1].text(0.1, 0.9, '\n'.join(metrics_text), transform=axes[1, 1].transAxes,
                       fontsize=11, verticalalignment='top', fontfamily='monospace')
        axes[1, 1].axis('off')
        axes[1, 1].set_title('Evaluation Summary')
        
        plt.tight_layout()
        plt.show()
# Run the complete evaluation pipeline

# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")

# Create evaluator
evaluator = MLTradingEvaluator(
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
)

# Run pipeline
evaluator.prepare_data(data)
evaluator.train_and_predict()

# Get full evaluation
report = evaluator.run_full_evaluation()
print("\nFull Evaluation Report:")
print(report.to_string(index=False))
# Visualize evaluation

evaluator.plot_evaluation()

Key Takeaways

  1. Accuracy alone is insufficient for evaluating trading models; always use precision, recall, F1, and AUC

  2. Confusion matrices reveal error patterns that summary metrics hide

  3. ROC curves and AUC provide threshold-independent model comparison

  4. Financial metrics (returns, Sharpe, drawdown) matter more than ML metrics for trading

  5. Walk-forward validation simulates real trading conditions and prevents overfitting

  6. Threshold optimization can significantly impact trading performance

  7. Compare multiple metrics across different objectives (ML vs financial) before selecting a model


Next: Module 8 - Regression Models (Return prediction, volatility forecasting)

Module 8: Regression Models

Part 3: Advanced Techniques

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-7

Learning Objectives

By the end of this module, you will be able to: - Apply regression models for return prediction - Forecast volatility using various techniques - Implement quantile regression for tail risk - Use ensemble methods for regression - Evaluate regression models with appropriate metrics

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Regression models
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet
)
from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor
)
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

import yfinance as yf

print("Module 8: Regression Models")
print("=" * 40)
# Prepare regression data

def prepare_regression_data(symbol: str = "SPY", period: str = "2y") -> Tuple:
    """Prepare features and continuous target for regression."""
    
    ticker = yf.Ticker(symbol)
    df = ticker.history(period=period)
    
    # Features
    df['returns'] = df['Close'].pct_change()
    df['volatility'] = df['returns'].rolling(20).std()
    
    for p in [5, 10, 20]:
        df[f'momentum_{p}'] = df['Close'].pct_change(p)
    
    for p in [5, 20, 50]:
        ma = df['Close'].rolling(p).mean()
        df[f'dist_ma{p}'] = (df['Close'] - ma) / ma
    
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    df['rsi'] = 100 - (100 / (1 + gain / loss))
    df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    
    # Continuous target: next day return
    df['target_return'] = df['returns'].shift(-1)
    
    # Alternative target: next 5-day return
    df['target_5d_return'] = df['Close'].pct_change(5).shift(-5)
    
    # Volatility target
    df['target_volatility'] = df['volatility'].shift(-1)
    
    df = df.dropna()
    
    features = ['volatility', 'momentum_5', 'momentum_10', 'momentum_20',
                'dist_ma5', 'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
    
    return df, features

# Load data
df, feature_cols = prepare_regression_data()
print(f"Data shape: {df.shape}")
print(f"Features: {feature_cols}")

Section 1: Linear Regression Models

Linear models are simple, interpretable, and often surprisingly effective for financial prediction.

# Linear Regression Concepts

linear_concepts = """
LINEAR REGRESSION FOR FINANCE
=============================

Basic Model:
------------
  y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

  Where:
  - y: Target (e.g., next day return)
  - xᵢ: Features (momentum, volatility, etc.)
  - βᵢ: Coefficients to learn
  - ε: Error term

Regularization Types:
---------------------
1. Ridge (L2): Shrinks coefficients, handles multicollinearity
   Loss = MSE + α * Σβᵢ²

2. Lasso (L1): Can zero out coefficients (feature selection)
   Loss = MSE + α * Σ|βᵢ|

3. ElasticNet: Combination of L1 and L2
   Loss = MSE + α * (r * Σ|βᵢ| + (1-r) * Σβᵢ²)

Advantages:
-----------
+ Interpretable coefficients
+ Fast to train and predict
+ Regularization prevents overfitting

Disadvantages:
--------------
- Assumes linear relationships
- May underfit complex patterns
- Sensitive to outliers

Financial Considerations:
-------------------------
- Returns are often nearly unpredictable (efficient markets)
- Small R² is normal (0.01-0.05 can be profitable)
- Coefficients show factor exposure
"""
print(linear_concepts)
# Prepare data for regression

X = df[feature_cols]
y = df['target_return']

# Time series split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training: {len(X_train)}, Test: {len(X_test)}")
print(f"\nTarget Statistics:")
print(f"  Mean: {y_train.mean():.6f}")
print(f"  Std: {y_train.std():.6f}")
# Basic Linear Regression

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

y_pred_lr = lr.predict(X_test_scaled)

# Metrics
mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_lr)
r2 = r2_score(y_test, y_pred_lr)

print(f"Linear Regression Results:")
print(f"  RMSE: {rmse:.6f}")
print(f"  MAE:  {mae:.6f}")
print(f"  R²:   {r2:.4f}")

# Coefficients
coef_df = pd.DataFrame({
    'feature': feature_cols,
    'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print(f"\nFeature Coefficients:")
for _, row in coef_df.iterrows():
    print(f"  {row['feature']:15s}: {row['coefficient']:+.6f}")
# Compare regularization methods

models = {
    'OLS': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.001),
    'ElasticNet': ElasticNet(alpha=0.001, l1_ratio=0.5)
}

results = []
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results.append({
        'Model': name,
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE': mean_absolute_error(y_test, y_pred),
        'R2': r2_score(y_test, y_pred)
    })

results_df = pd.DataFrame(results)
print("Linear Models Comparison:")
print(results_df.to_string(index=False))
# Exercise 8.1: Regularization Tuner (Guided)

def tune_ridge_alpha(X_train: np.ndarray, y_train: pd.Series,
                     alphas: List[float] = None,
                     cv_folds: int = 5) -> Dict:
    """
    Tune Ridge regression alpha using time series CV.
    
    Returns:
        Dictionary with best alpha and cross-validation results
    """
    if alphas is None:
        alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    
    # TODO: Create time series cross-validator
    tscv = ______(n_splits=cv_folds)
    
    results = []
    best_score = float('-inf')
    best_alpha = None
    
    for alpha in alphas:
        # TODO: Create Ridge model with current alpha
        model = ______(alpha=______)
        
        # TODO: Get cross-validation scores (negative MSE)
        scores = ______(model, X_train, y_train, cv=tscv, 
                              scoring='neg_mean_squared_error')
        mean_score = scores.mean()
        
        results.append({
            'alpha': alpha,
            'mean_neg_mse': mean_score,
            'std_neg_mse': scores.std()
        })
        
        if mean_score > best_score:
            best_score = mean_score
            best_alpha = alpha
    
    return {
        'best_alpha': best_alpha,
        'best_score': best_score,
        'all_results': pd.DataFrame(results)
    }

# Test the function
# ridge_tuning = tune_ridge_alpha(X_train_scaled, y_train)
Solution 8.1
def tune_ridge_alpha(X_train: np.ndarray, y_train: pd.Series,
                     alphas: List[float] = None,
                     cv_folds: int = 5) -> Dict:
    """
    Tune Ridge regression alpha using time series CV.
    """
    if alphas is None:
        alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

    tscv = TimeSeriesSplit(n_splits=cv_folds)

    results = []
    best_score = float('-inf')
    best_alpha = None

    for alpha in alphas:
        model = Ridge(alpha=alpha)

        scores = cross_val_score(model, X_train, y_train, cv=tscv, 
                              scoring='neg_mean_squared_error')
        mean_score = scores.mean()

        results.append({
            'alpha': alpha,
            'mean_neg_mse': mean_score,
            'std_neg_mse': scores.std()
        })

        if mean_score > best_score:
            best_score = mean_score
            best_alpha = alpha

    return {
        'best_alpha': best_alpha,
        'best_score': best_score,
        'all_results': pd.DataFrame(results)
    }

Section 2: Tree-Based Regression

Random Forest and Gradient Boosting for non-linear return prediction.

# Random Forest Regressor

rf_reg = RandomForestRegressor(
    n_estimators=100,
    max_depth=5,
    min_samples_leaf=20,
    random_state=42,
    n_jobs=-1
)

rf_reg.fit(X_train_scaled, y_train)
y_pred_rf = rf_reg.predict(X_test_scaled)

print(f"Random Forest Regressor Results:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.6f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_rf):.6f}")
print(f"  R²:   {r2_score(y_test, y_pred_rf):.4f}")

# Feature importance
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nFeature Importance:")
for _, row in importance_df.iterrows():
    print(f"  {row['feature']:15s}: {row['importance']:.4f}")
# Gradient Boosting Regressor

gb_reg = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_leaf=20,
    subsample=0.8,
    random_state=42
)

gb_reg.fit(X_train_scaled, y_train)
y_pred_gb = gb_reg.predict(X_test_scaled)

print(f"Gradient Boosting Regressor Results:")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_gb)):.6f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_pred_gb):.6f}")
print(f"  R²:   {r2_score(y_test, y_pred_gb):.4f}")
# Exercise 8.2: Ensemble Regressor (Guided)

def create_ensemble_regressor(models: List, weights: List[float] = None) -> object:
    """
    Create a weighted ensemble of regression models.
    
    Returns:
        Object with fit, predict methods
    """
    class EnsembleRegressor:
        def __init__(self, models, weights):
            self.models = models
            # TODO: Set weights (equal if not provided)
            self.weights = weights if weights else [1/len(______)] * len(models)
            
        def fit(self, X, y):
            # TODO: Fit all models
            for model in self.______:
                model.______(X, y)
            return self
        
        def predict(self, X):
            # TODO: Weighted average of predictions
            predictions = np.zeros(len(X))
            for model, weight in zip(self.models, self.weights):
                predictions += ______ * model.______(X)
            return predictions
    
    return EnsembleRegressor(models, weights)

# Test the function
# ensemble = create_ensemble_regressor(
#     [Ridge(), RandomForestRegressor(n_estimators=50, max_depth=5)],
#     weights=[0.3, 0.7]
# )
Solution 8.2
def create_ensemble_regressor(models: List, weights: List[float] = None) -> object:
    """
    Create a weighted ensemble of regression models.
    """
    class EnsembleRegressor:
        def __init__(self, models, weights):
            self.models = models
            self.weights = weights if weights else [1/len(models)] * len(models)

        def fit(self, X, y):
            for model in self.models:
                model.fit(X, y)
            return self

        def predict(self, X):
            predictions = np.zeros(len(X))
            for model, weight in zip(self.models, self.weights):
                predictions += weight * model.predict(X)
            return predictions

    return EnsembleRegressor(models, weights)

Section 3: Volatility Forecasting

Predicting volatility is often easier and more useful than predicting returns.

# Volatility Forecasting Concepts

vol_concepts = """
VOLATILITY FORECASTING
======================

Why Volatility?
---------------
- More predictable than returns
- Clusters (high vol followed by high vol)
- Critical for risk management
- Used in options pricing

Common Measures:
----------------
1. Historical Volatility
   σ = std(returns) * sqrt(252)

2. Realized Volatility
   RV = sqrt(Σ(intraday returns)²)

3. Range-Based (Parkinson)
   σ = sqrt(ln(High/Low)² / (4*ln(2)))

ML Approaches:
--------------
- Predict next day/week volatility
- Use lagged volatility as key feature
- Often more successful than return prediction

Applications:
-------------
- Position sizing
- Risk budgeting
- Options trading
- VaR calculation
"""
print(vol_concepts)
# Prepare volatility prediction data

def prepare_volatility_data(df: pd.DataFrame, vol_window: int = 20) -> Tuple:
    """Prepare features for volatility prediction."""
    
    vol_df = pd.DataFrame(index=df.index)
    
    # Current volatility (lagged features)
    returns = df['Close'].pct_change()
    vol_df['vol_20d'] = returns.rolling(vol_window).std()
    vol_df['vol_5d'] = returns.rolling(5).std()
    vol_df['vol_10d'] = returns.rolling(10).std()
    
    # Volatility ratios
    vol_df['vol_ratio_5_20'] = vol_df['vol_5d'] / vol_df['vol_20d']
    
    # Range-based volatility (Parkinson)
    vol_df['parkinson_vol'] = np.sqrt(
        (np.log(df['High'] / df['Low']) ** 2).rolling(vol_window).mean() / (4 * np.log(2))
    )
    
    # Absolute returns
    vol_df['abs_return_1d'] = returns.abs()
    vol_df['abs_return_5d'] = returns.abs().rolling(5).mean()
    
    # Volume features
    vol_df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    
    # Target: next day's volatility
    vol_df['target_vol'] = vol_df['vol_20d'].shift(-1)
    
    vol_df = vol_df.dropna()
    
    features = ['vol_20d', 'vol_5d', 'vol_10d', 'vol_ratio_5_20',
                'parkinson_vol', 'abs_return_1d', 'abs_return_5d', 'volume_ratio']
    
    return vol_df[features], vol_df['target_vol']

# Prepare volatility data
X_vol, y_vol = prepare_volatility_data(df)

# Split
split_idx = int(len(X_vol) * 0.8)
X_vol_train, X_vol_test = X_vol[:split_idx], X_vol[split_idx:]
y_vol_train, y_vol_test = y_vol[:split_idx], y_vol[split_idx:]

# Scale
vol_scaler = StandardScaler()
X_vol_train_scaled = vol_scaler.fit_transform(X_vol_train)
X_vol_test_scaled = vol_scaler.transform(X_vol_test)

print(f"Volatility prediction data: {len(X_vol)} samples")
# Train volatility forecasting models

vol_models = {
    'Ridge': Ridge(alpha=1.0),
    'RF': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42),
    'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
}

print("Volatility Forecasting Results:")
print("=" * 50)

for name, model in vol_models.items():
    model.fit(X_vol_train_scaled, y_vol_train)
    y_pred = model.predict(X_vol_test_scaled)
    
    rmse = np.sqrt(mean_squared_error(y_vol_test, y_pred))
    mae = mean_absolute_error(y_vol_test, y_pred)
    r2 = r2_score(y_vol_test, y_pred)
    
    print(f"\n{name}:")
    print(f"  RMSE: {rmse:.6f}")
    print(f"  MAE:  {mae:.6f}")
    print(f"  R²:   {r2:.4f}")
# Visualize volatility predictions

# Use the best model (GB)
gb_vol = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
gb_vol.fit(X_vol_train_scaled, y_vol_train)
vol_pred = gb_vol.predict(X_vol_test_scaled)

# Plot
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Actual vs Predicted
axes[0].plot(y_vol_test.index, y_vol_test.values, label='Actual', alpha=0.7)
axes[0].plot(y_vol_test.index, vol_pred, label='Predicted', alpha=0.7)
axes[0].set_ylabel('Volatility')
axes[0].set_title('Volatility Forecast: Actual vs Predicted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scatter plot
axes[1].scatter(y_vol_test, vol_pred, alpha=0.5)
axes[1].plot([y_vol_test.min(), y_vol_test.max()],
             [y_vol_test.min(), y_vol_test.max()], 'r--', label='Perfect')
axes[1].set_xlabel('Actual Volatility')
axes[1].set_ylabel('Predicted Volatility')
axes[1].set_title('Prediction Scatter Plot')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Exercise 8.3: Volatility Forecaster (Guided)

class VolatilityForecaster:
    """
    Multi-horizon volatility forecasting system.
    """
    
    def __init__(self, horizons: List[int] = [1, 5, 20]):
        self.horizons = horizons
        self.models = {}
        self.scalers = {}
        
    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create volatility features."""
        features = pd.DataFrame(index=df.index)
        returns = df['Close'].pct_change()
        
        # TODO: Add volatility features for multiple windows
        for window in [5, 10, 20, 60]:
            features[f'vol_{window}d'] = returns.rolling(______).______()
        
        # TODO: Add Parkinson volatility
        log_hl = np.log(df['High'] / df['______'])
        features['parkinson'] = np.sqrt((log_hl ** 2).rolling(20).mean() / (4 * np.log(2)))
        
        features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
        
        return features.dropna()
    
    def fit(self, df: pd.DataFrame):
        """Fit models for each horizon."""
        features = self.create_features(df)
        returns = df['Close'].pct_change()
        
        for horizon in self.horizons:
            # Create target: forward volatility
            target = returns.rolling(horizon).std().shift(-horizon)
            
            # Align and clean
            aligned = pd.concat([features, target.rename('target')], axis=1).dropna()
            
            X = aligned.drop('target', axis=1)
            y = aligned['target']
            
            # Scale and fit
            self.scalers[horizon] = StandardScaler()
            X_scaled = self.scalers[horizon].fit_transform(X)
            
            self.models[horizon] = GradientBoostingRegressor(
                n_estimators=100, max_depth=3, random_state=42
            )
            self.models[horizon].fit(X_scaled, y)
        
        return self
    
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Predict volatility for all horizons."""
        features = self.create_features(df)
        
        predictions = pd.DataFrame(index=features.index)
        for horizon in self.horizons:
            X_scaled = self.scalers[horizon].transform(features)
            predictions[f'vol_{horizon}d'] = self.models[horizon].predict(X_scaled)
        
        return predictions

# Test
# forecaster = VolatilityForecaster([1, 5, 20])
# forecaster.fit(df)
Solution 8.3
class VolatilityForecaster:
    def __init__(self, horizons: List[int] = [1, 5, 20]):
        self.horizons = horizons
        self.models = {}
        self.scalers = {}

    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        features = pd.DataFrame(index=df.index)
        returns = df['Close'].pct_change()

        for window in [5, 10, 20, 60]:
            features[f'vol_{window}d'] = returns.rolling(window).std()

        log_hl = np.log(df['High'] / df['Low'])
        features['parkinson'] = np.sqrt((log_hl ** 2).rolling(20).mean() / (4 * np.log(2)))

        features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()

        return features.dropna()

    def fit(self, df: pd.DataFrame):
        features = self.create_features(df)
        returns = df['Close'].pct_change()

        for horizon in self.horizons:
            target = returns.rolling(horizon).std().shift(-horizon)
            aligned = pd.concat([features, target.rename('target')], axis=1).dropna()

            X = aligned.drop('target', axis=1)
            y = aligned['target']

            self.scalers[horizon] = StandardScaler()
            X_scaled = self.scalers[horizon].fit_transform(X)

            self.models[horizon] = GradientBoostingRegressor(
                n_estimators=100, max_depth=3, random_state=42
            )
            self.models[horizon].fit(X_scaled, y)

        return self

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        features = self.create_features(df)

        predictions = pd.DataFrame(index=features.index)
        for horizon in self.horizons:
            X_scaled = self.scalers[horizon].transform(features)
            predictions[f'vol_{horizon}d'] = self.models[horizon].predict(X_scaled)

        return predictions

Section 4: Quantile Regression

Predict different percentiles of the return distribution for tail risk analysis.

# Quantile Regression Concepts

quantile_concepts = """
QUANTILE REGRESSION
===================

What is Quantile Regression?
----------------------------
- Predict specific percentiles instead of mean
- Estimate conditional distribution of returns
- Essential for tail risk (VaR, CVaR)

Common Quantiles:
-----------------
- q=0.01: 1% worst case (VaR 99%)
- q=0.05: 5% worst case (VaR 95%)
- q=0.50: Median (robust to outliers)
- q=0.95: 5% best case
- q=0.99: 1% best case

Loss Function:
--------------
L(y, ŷ, q) = max(q*(y-ŷ), (q-1)*(y-ŷ))

Pinball loss that asymmetrically penalizes
under and over predictions.

Applications:
-------------
- Value at Risk (VaR)
- Conditional VaR (Expected Shortfall)
- Prediction intervals
- Tail risk management
"""
print(quantile_concepts)
# Quantile Regression with Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor

quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
quantile_models = {}

for q in quantiles:
    model = GradientBoostingRegressor(
        loss='quantile',
        alpha=q,
        n_estimators=100,
        max_depth=3,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    quantile_models[q] = model

# Predict all quantiles
quantile_preds = pd.DataFrame(index=y_test.index)
for q, model in quantile_models.items():
    quantile_preds[f'q{int(q*100):02d}'] = model.predict(X_test_scaled)

print("Quantile Predictions (first 5 rows):")
print(quantile_preds.head())
# Visualize quantile predictions

fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Time series with prediction intervals
sample_size = 50
sample_idx = range(len(quantile_preds) - sample_size, len(quantile_preds))

axes[0].fill_between(range(sample_size), 
                     quantile_preds['q05'].iloc[sample_idx],
                     quantile_preds['q95'].iloc[sample_idx],
                     alpha=0.2, label='90% CI')
axes[0].fill_between(range(sample_size),
                     quantile_preds['q25'].iloc[sample_idx],
                     quantile_preds['q75'].iloc[sample_idx],
                     alpha=0.4, label='50% CI')
axes[0].plot(range(sample_size), quantile_preds['q50'].iloc[sample_idx],
            'b-', label='Median')
axes[0].plot(range(sample_size), y_test.iloc[sample_idx].values,
            'ro', markersize=4, label='Actual')
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Return')
axes[0].set_title('Quantile Predictions with Confidence Intervals')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Distribution of predictions
for q in [0.05, 0.50, 0.95]:
    axes[1].hist(quantile_models[q].predict(X_test_scaled), 
                bins=30, alpha=0.5, label=f'q{int(q*100)}')
axes[1].axvline(x=0, color='black', linestyle='--')
axes[1].set_xlabel('Predicted Return')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Quantile Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Exercise 8.4: VaR Predictor (Open-ended)
#
# Build a VaRPredictor class that:
# - Uses quantile regression to predict VaR at different confidence levels
# - Calculates Expected Shortfall (CVaR)
# - Provides backtesting for VaR violations
# - Visualizes VaR predictions vs actual returns
# - Reports coverage statistics
#
# Your implementation:
Solution 8.4
class VaRPredictor:
    """Value at Risk prediction using quantile regression."""

    def __init__(self, confidence_levels: List[float] = [0.95, 0.99]):
        self.confidence_levels = confidence_levels
        self.models = {}
        self.scaler = StandardScaler()

    def fit(self, X: pd.DataFrame, y: pd.Series):
        """Fit quantile models for each confidence level."""
        X_scaled = self.scaler.fit_transform(X)

        for conf in self.confidence_levels:
            alpha = 1 - conf  # VaR quantile
            self.models[conf] = GradientBoostingRegressor(
                loss='quantile',
                alpha=alpha,
                n_estimators=100,
                max_depth=3,
                random_state=42
            )
            self.models[conf].fit(X_scaled, y)

        return self

    def predict_var(self, X: pd.DataFrame) -> pd.DataFrame:
        """Predict VaR for all confidence levels."""
        X_scaled = self.scaler.transform(X)

        var_preds = pd.DataFrame(index=X.index)
        for conf in self.confidence_levels:
            var_preds[f'VaR_{int(conf*100)}'] = -self.models[conf].predict(X_scaled)

        return var_preds

    def backtest(self, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
        """Backtest VaR predictions."""
        var_preds = self.predict_var(X)

        results = []
        for conf in self.confidence_levels:
            var_col = f'VaR_{int(conf*100)}'
            violations = (-y.values < -var_preds[var_col].values).sum()
            expected = (1 - conf) * len(y)

            results.append({
                'confidence': conf,
                'violations': violations,
                'expected': expected,
                'violation_rate': violations / len(y),
                'expected_rate': 1 - conf
            })

        return pd.DataFrame(results)

    def calculate_cvar(self, X: pd.DataFrame, y: pd.Series, 
                       confidence: float = 0.95) -> pd.Series:
        """Calculate Conditional VaR (Expected Shortfall)."""
        var_preds = self.predict_var(X)
        var_col = f'VaR_{int(confidence*100)}'

        # CVaR = average of returns below VaR
        mask = -y.values < -var_preds[var_col].values
        if mask.sum() > 0:
            cvar = -y[mask].mean()
        else:
            cvar = var_preds[var_col].mean()

        return cvar

    def plot_backtest(self, X: pd.DataFrame, y: pd.Series):
        """Visualize VaR backtest."""
        var_preds = self.predict_var(X)

        fig, axes = plt.subplots(len(self.confidence_levels), 1, 
                                 figsize=(14, 4*len(self.confidence_levels)))
        if len(self.confidence_levels) == 1:
            axes = [axes]

        for ax, conf in zip(axes, self.confidence_levels):
            var_col = f'VaR_{int(conf*100)}'

            ax.plot(y.index, y.values, 'b-', alpha=0.5, label='Returns')
            ax.plot(y.index, -var_preds[var_col].values, 'r-', label=var_col)

            # Mark violations
            violations = -y.values < -var_preds[var_col].values
            ax.scatter(y.index[violations], y.values[violations], 
                      c='red', s=50, zorder=5, label='Violations')

            ax.set_title(f'VaR {int(conf*100)}% Backtest')
            ax.legend()
            ax.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()
# Exercise 8.5: Multi-Horizon Return Predictor (Open-ended)
#
# Build a MultiHorizonPredictor class that:
# - Predicts returns at multiple horizons (1, 5, 10, 20 days)
# - Uses different models for each horizon
# - Provides uncertainty estimates
# - Evaluates prediction accuracy at each horizon
# - Generates a comprehensive prediction report
#
# Your implementation:
Solution 8.5
class MultiHorizonPredictor:
    """Predict returns at multiple horizons."""

    def __init__(self, horizons: List[int] = [1, 5, 10, 20]):
        self.horizons = horizons
        self.models = {}
        self.scalers = {}
        self.feature_names = None

    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create features from price data."""
        features = pd.DataFrame(index=df.index)
        returns = df['Close'].pct_change()

        features['volatility'] = returns.rolling(20).std()
        for p in [5, 10, 20]:
            features[f'momentum_{p}'] = df['Close'].pct_change(p)
        for p in [5, 20, 50]:
            ma = df['Close'].rolling(p).mean()
            features[f'dist_ma{p}'] = (df['Close'] - ma) / ma

        delta = df['Close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        features['rsi'] = 100 - (100 / (1 + gain / loss))

        return features.dropna()

    def fit(self, df: pd.DataFrame):
        """Fit models for each horizon."""
        features = self.create_features(df)
        self.feature_names = features.columns.tolist()

        for horizon in self.horizons:
            # Create target
            target = df['Close'].pct_change(horizon).shift(-horizon)
            aligned = pd.concat([features, target.rename('target')], axis=1).dropna()

            X = aligned.drop('target', axis=1)
            y = aligned['target']

            self.scalers[horizon] = StandardScaler()
            X_scaled = self.scalers[horizon].fit_transform(X)

            # Use ensemble
            self.models[horizon] = {
                'point': GradientBoostingRegressor(
                    n_estimators=100, max_depth=3, random_state=42
                ),
                'lower': GradientBoostingRegressor(
                    loss='quantile', alpha=0.1, n_estimators=100, 
                    max_depth=3, random_state=42
                ),
                'upper': GradientBoostingRegressor(
                    loss='quantile', alpha=0.9, n_estimators=100,
                    max_depth=3, random_state=42
                )
            }

            for model in self.models[horizon].values():
                model.fit(X_scaled, y)

        return self

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Predict returns with uncertainty."""
        features = self.create_features(df)

        predictions = pd.DataFrame(index=features.index)
        for horizon in self.horizons:
            X_scaled = self.scalers[horizon].transform(features)

            predictions[f'h{horizon}_point'] = self.models[horizon]['point'].predict(X_scaled)
            predictions[f'h{horizon}_lower'] = self.models[horizon]['lower'].predict(X_scaled)
            predictions[f'h{horizon}_upper'] = self.models[horizon]['upper'].predict(X_scaled)

        return predictions

    def evaluate(self, df: pd.DataFrame, test_frac: float = 0.2) -> pd.DataFrame:
        """Evaluate predictions at each horizon."""
        features = self.create_features(df)
        split_idx = int(len(features) * (1 - test_frac))

        results = []
        for horizon in self.horizons:
            target = df['Close'].pct_change(horizon).shift(-horizon)
            aligned = pd.concat([features, target.rename('target')], axis=1).dropna()

            test_features = aligned.drop('target', axis=1)[split_idx:]
            test_target = aligned['target'][split_idx:]

            X_scaled = self.scalers[horizon].transform(test_features)
            y_pred = self.models[horizon]['point'].predict(X_scaled)

            results.append({
                'horizon': horizon,
                'rmse': np.sqrt(mean_squared_error(test_target, y_pred)),
                'mae': mean_absolute_error(test_target, y_pred),
                'r2': r2_score(test_target, y_pred)
            })

        return pd.DataFrame(results)
# Exercise 8.6: Complete Regression Evaluator (Open-ended)
#
# Build a RegressionEvaluator class that:
# - Compares multiple regression models
# - Uses walk-forward validation
# - Calculates regression metrics (MSE, MAE, R2)
# - Calculates direction accuracy (sign of prediction)
# - Computes information coefficient (IC)
# - Generates visualization of residuals and predictions
#
# Your implementation:
Solution 8.6
from scipy.stats import spearmanr

class RegressionEvaluator:
    """Comprehensive regression model evaluation."""

    def __init__(self, models: Dict):
        self.models = models
        self.results = {}
        self.predictions = {}

    def evaluate(self, X_train, y_train, X_test, y_test):
        """Evaluate all models."""
        for name, model in self.models.items():
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            self.predictions[name] = y_pred

            # Regression metrics
            mse = mean_squared_error(y_test, y_pred)
            mae = mean_absolute_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            # Direction accuracy
            dir_acc = ((y_test > 0) == (y_pred > 0)).mean()

            # Information coefficient (rank correlation)
            ic, _ = spearmanr(y_test, y_pred)

            self.results[name] = {
                'mse': mse,
                'rmse': np.sqrt(mse),
                'mae': mae,
                'r2': r2,
                'direction_accuracy': dir_acc,
                'information_coefficient': ic
            }

        return self

    def walk_forward_evaluate(self, X, y, train_size: int = 200,
                               step_size: int = 20):
        """Walk-forward evaluation."""
        for name, model in self.models.items():
            all_preds = []
            all_actuals = []

            start = 0
            while start + train_size < len(X):
                train_end = start + train_size
                test_end = min(train_end + step_size, len(X))

                X_train = X.iloc[start:train_end]
                y_train_fold = y.iloc[start:train_end]
                X_test = X.iloc[train_end:test_end]
                y_test_fold = y.iloc[train_end:test_end]

                model_clone = model.__class__(**model.get_params())
                model_clone.fit(X_train, y_train_fold)

                all_preds.extend(model_clone.predict(X_test))
                all_actuals.extend(y_test_fold.values)

                start += step_size

            self.predictions[name] = np.array(all_preds)
            y_test_wf = np.array(all_actuals)

            mse = mean_squared_error(y_test_wf, all_preds)
            ic, _ = spearmanr(y_test_wf, all_preds)

            self.results[name] = {
                'rmse': np.sqrt(mse),
                'mae': mean_absolute_error(y_test_wf, all_preds),
                'r2': r2_score(y_test_wf, all_preds),
                'direction_accuracy': ((y_test_wf > 0) == (np.array(all_preds) > 0)).mean(),
                'ic': ic
            }

        return self

    def get_comparison_table(self) -> pd.DataFrame:
        """Get comparison DataFrame."""
        return pd.DataFrame(self.results).T

    def plot_results(self, y_test):
        """Plot predictions and residuals."""
        n_models = len(self.models)
        fig, axes = plt.subplots(n_models, 2, figsize=(14, 4*n_models))
        if n_models == 1:
            axes = axes.reshape(1, -1)

        for i, (name, preds) in enumerate(self.predictions.items()):
            # Scatter plot
            axes[i, 0].scatter(y_test[:len(preds)], preds, alpha=0.5)
            axes[i, 0].plot([y_test.min(), y_test.max()],
                           [y_test.min(), y_test.max()], 'r--')
            axes[i, 0].set_xlabel('Actual')
            axes[i, 0].set_ylabel('Predicted')
            axes[i, 0].set_title(f'{name}: Actual vs Predicted')
            axes[i, 0].grid(True, alpha=0.3)

            # Residuals
            residuals = y_test[:len(preds)].values - preds
            axes[i, 1].hist(residuals, bins=30, alpha=0.7)
            axes[i, 1].axvline(x=0, color='red', linestyle='--')
            axes[i, 1].set_xlabel('Residual')
            axes[i, 1].set_ylabel('Frequency')
            axes[i, 1].set_title(f'{name}: Residual Distribution')
            axes[i, 1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

Module Project: Complete Return Prediction System

Build a comprehensive system for return and volatility prediction.

class ReturnPredictionSystem:
    """
    Complete system for return and volatility prediction.
    
    Features:
    - Multiple model types (linear, tree, ensemble)
    - Return and volatility forecasting
    - Quantile predictions for risk management
    - Walk-forward validation
    """
    
    def __init__(self):
        self.scaler = StandardScaler()
        self.return_models = {}
        self.vol_models = {}
        self.quantile_models = {}
        self.feature_names = None
        
    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create comprehensive feature set."""
        features = pd.DataFrame(index=df.index)
        
        returns = df['Close'].pct_change()
        
        # Volatility features
        for w in [5, 10, 20]:
            features[f'vol_{w}d'] = returns.rolling(w).std()
        
        # Momentum features
        for p in [5, 10, 20]:
            features[f'mom_{p}d'] = df['Close'].pct_change(p)
        
        # MA distances
        for p in [5, 20, 50]:
            ma = df['Close'].rolling(p).mean()
            features[f'dist_ma{p}'] = (df['Close'] - ma) / ma
        
        # RSI
        delta = df['Close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        features['rsi'] = 100 - (100 / (1 + gain / loss))
        
        # Volume
        features['vol_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
        
        return features.dropna()
    
    def fit(self, df: pd.DataFrame, test_frac: float = 0.2):
        """Fit all prediction models."""
        features = self.create_features(df)
        self.feature_names = features.columns.tolist()
        
        returns = df['Close'].pct_change()
        volatility = returns.rolling(20).std()
        
        # Align targets
        target_return = returns.shift(-1)
        target_vol = volatility.shift(-1)
        
        combined = pd.concat([
            features, 
            target_return.rename('target_return'),
            target_vol.rename('target_vol')
        ], axis=1).dropna()
        
        X = combined[self.feature_names]
        y_return = combined['target_return']
        y_vol = combined['target_vol']
        
        # Split
        split_idx = int(len(X) * (1 - test_frac))
        self.X_train = X[:split_idx]
        self.X_test = X[split_idx:]
        self.y_return_train = y_return[:split_idx]
        self.y_return_test = y_return[split_idx:]
        self.y_vol_train = y_vol[:split_idx]
        self.y_vol_test = y_vol[split_idx:]
        
        # Scale
        self.X_train_scaled = self.scaler.fit_transform(self.X_train)
        self.X_test_scaled = self.scaler.transform(self.X_test)
        
        # Train return models
        self.return_models = {
            'Ridge': Ridge(alpha=1.0),
            'RF': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42),
            'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
        }
        
        for name, model in self.return_models.items():
            model.fit(self.X_train_scaled, self.y_return_train)
        
        # Train volatility models
        self.vol_models = {
            'Ridge': Ridge(alpha=1.0),
            'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
        }
        
        for name, model in self.vol_models.items():
            model.fit(self.X_train_scaled, self.y_vol_train)
        
        # Train quantile models
        for q in [0.05, 0.50, 0.95]:
            self.quantile_models[q] = GradientBoostingRegressor(
                loss='quantile', alpha=q, n_estimators=100, 
                max_depth=3, random_state=42
            )
            self.quantile_models[q].fit(self.X_train_scaled, self.y_return_train)
        
        return self
    
    def evaluate(self) -> Dict:
        """Evaluate all models."""
        results = {'return_models': {}, 'vol_models': {}}
        
        for name, model in self.return_models.items():
            y_pred = model.predict(self.X_test_scaled)
            results['return_models'][name] = {
                'rmse': np.sqrt(mean_squared_error(self.y_return_test, y_pred)),
                'r2': r2_score(self.y_return_test, y_pred),
                'dir_acc': ((self.y_return_test > 0) == (y_pred > 0)).mean()
            }
        
        for name, model in self.vol_models.items():
            y_pred = model.predict(self.X_test_scaled)
            results['vol_models'][name] = {
                'rmse': np.sqrt(mean_squared_error(self.y_vol_test, y_pred)),
                'r2': r2_score(self.y_vol_test, y_pred)
            }
        
        return results
    
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Generate all predictions."""
        features = self.create_features(df)
        X_scaled = self.scaler.transform(features)
        
        predictions = pd.DataFrame(index=features.index)
        
        # Ensemble return prediction
        return_preds = np.zeros(len(features))
        for model in self.return_models.values():
            return_preds += model.predict(X_scaled) / len(self.return_models)
        predictions['return_pred'] = return_preds
        
        # Volatility prediction
        predictions['vol_pred'] = self.vol_models['GB'].predict(X_scaled)
        
        # Quantile predictions
        for q, model in self.quantile_models.items():
            predictions[f'q{int(q*100):02d}'] = model.predict(X_scaled)
        
        return predictions
    
    def plot_results(self):
        """Visualize results."""
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Return predictions
        best_model = self.return_models['GB']
        y_pred = best_model.predict(self.X_test_scaled)
        
        axes[0, 0].scatter(self.y_return_test, y_pred, alpha=0.5)
        axes[0, 0].plot([self.y_return_test.min(), self.y_return_test.max()],
                       [self.y_return_test.min(), self.y_return_test.max()], 'r--')
        axes[0, 0].set_xlabel('Actual Return')
        axes[0, 0].set_ylabel('Predicted Return')
        axes[0, 0].set_title('Return Prediction')
        axes[0, 0].grid(True, alpha=0.3)
        
        # Volatility predictions
        vol_pred = self.vol_models['GB'].predict(self.X_test_scaled)
        axes[0, 1].scatter(self.y_vol_test, vol_pred, alpha=0.5)
        axes[0, 1].plot([self.y_vol_test.min(), self.y_vol_test.max()],
                       [self.y_vol_test.min(), self.y_vol_test.max()], 'r--')
        axes[0, 1].set_xlabel('Actual Volatility')
        axes[0, 1].set_ylabel('Predicted Volatility')
        axes[0, 1].set_title('Volatility Prediction')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Quantile predictions
        q05 = self.quantile_models[0.05].predict(self.X_test_scaled)
        q95 = self.quantile_models[0.95].predict(self.X_test_scaled)
        
        axes[1, 0].fill_between(range(len(q05)), q05, q95, alpha=0.3, label='90% CI')
        axes[1, 0].plot(self.y_return_test.values, 'b-', alpha=0.7, label='Actual')
        axes[1, 0].set_xlabel('Day')
        axes[1, 0].set_ylabel('Return')
        axes[1, 0].set_title('Quantile Predictions')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # Model comparison
        results = self.evaluate()
        models = list(results['return_models'].keys())
        r2_scores = [results['return_models'][m]['r2'] for m in models]
        dir_accs = [results['return_models'][m]['dir_acc'] for m in models]
        
        x = np.arange(len(models))
        width = 0.35
        axes[1, 1].bar(x - width/2, r2_scores, width, label='R²', alpha=0.8)
        axes[1, 1].bar(x + width/2, dir_accs, width, label='Dir Acc', alpha=0.8)
        axes[1, 1].set_xticks(x)
        axes[1, 1].set_xticklabels(models)
        axes[1, 1].set_title('Model Comparison')
        axes[1, 1].legend()
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
# Test the complete system

# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")

# Create and fit system
system = ReturnPredictionSystem()
system.fit(data)

# Evaluate
results = system.evaluate()

print("Return Prediction Results:")
print(pd.DataFrame(results['return_models']).T)

print("\nVolatility Prediction Results:")
print(pd.DataFrame(results['vol_models']).T)
# Visualize results

system.plot_results()

Key Takeaways

  1. Regularized linear models (Ridge, Lasso, ElasticNet) are simple baselines that often perform well

  2. Tree-based regressors (Random Forest, Gradient Boosting) capture non-linear patterns

  3. Volatility is more predictable than returns due to clustering effects

  4. Quantile regression provides uncertainty estimates and is essential for risk management

  5. Low R² is normal for return prediction; even 1-5% can be profitable

  6. Direction accuracy often matters more than exact return prediction for trading

  7. Walk-forward validation prevents overfitting and simulates real trading conditions


Next: Module 9 - Sentiment Analysis (Text processing, sentiment scoring, news signals)

Module 9: Sentiment Analysis

Part 3: Advanced Techniques

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-8

Learning Objectives

By the end of this module, you will be able to: - Process and clean financial text data - Apply sentiment scoring techniques to news and social media - Use pre-trained models for financial sentiment - Combine sentiment signals with price data - Evaluate sentiment-based trading strategies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
try:
    from textblob import TextBlob
    HAS_TEXTBLOB = True
except ImportError:
    HAS_TEXTBLOB = False
    print("TextBlob not installed. Install with: pip install textblob")

try:
    import nltk
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    nltk.download('vader_lexicon', quiet=True)
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    HAS_NLTK = True
except ImportError:
    HAS_NLTK = False
    print("NLTK not installed. Install with: pip install nltk")

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report

import yfinance as yf

print("Module 9: Sentiment Analysis")
print("=" * 40)

Section 1: Text Processing Fundamentals

Before analyzing sentiment, we need to clean and preprocess text data.

# Sentiment Analysis Concepts

sentiment_concepts = """
SENTIMENT ANALYSIS FOR FINANCE
==============================

Why Sentiment Matters:
----------------------
- Market moves on news and perception
- Sentiment can lead price movements
- Social media provides real-time signals
- News impacts trading volumes

Data Sources:
-------------
1. News Articles
   - Financial news (Bloomberg, Reuters)
   - Press releases
   - Analyst reports

2. Social Media
   - Twitter/X (high frequency)
   - Reddit (r/wallstreetbets)
   - StockTwits

3. Company Filings
   - 10-K, 10-Q reports
   - Earnings call transcripts
   - Conference calls

Sentiment Scoring Methods:
--------------------------
1. Lexicon-Based
   - Dictionary of positive/negative words
   - VADER, Loughran-McDonald
   - Fast but may miss context

2. Machine Learning
   - Train classifier on labeled data
   - Can capture nuance
   - Needs training data

3. Deep Learning
   - Transformers (BERT, FinBERT)
   - State-of-the-art accuracy
   - Computationally expensive
"""
print(sentiment_concepts)
# Sample financial news headlines

sample_headlines = [
    "Apple reports record quarterly earnings, beats analyst expectations",
    "Tech stocks tumble amid interest rate concerns",
    "Federal Reserve signals potential rate cuts in 2024",
    "Tesla faces production challenges, stock drops 5%",
    "Microsoft's AI investments show strong returns",
    "Oil prices surge on Middle East tensions",
    "Retail sales disappoint, raising recession fears",
    "Goldman Sachs upgrades Amazon to buy rating",
    "Cryptocurrency market sees massive selloff",
    "Strong jobs report eases inflation concerns",
    "Boeing faces new safety investigation",
    "Nvidia stock hits all-time high on AI demand",
    "Bank earnings mixed amid economic uncertainty",
    "Housing market shows signs of cooling",
    "Disney streaming losses narrow, shares rally"
]

print(f"Sample Headlines ({len(sample_headlines)}):")
for i, headline in enumerate(sample_headlines[:5], 1):
    print(f"  {i}. {headline}")
# Text preprocessing functions

class TextPreprocessor:
    """Clean and preprocess financial text."""
    
    def __init__(self):
        # Common financial abbreviations to expand
        self.abbreviations = {
            'Q1': 'first quarter',
            'Q2': 'second quarter',
            'Q3': 'third quarter',
            'Q4': 'fourth quarter',
            'CEO': 'chief executive officer',
            'CFO': 'chief financial officer',
            'IPO': 'initial public offering',
            'EPS': 'earnings per share',
            'M&A': 'mergers and acquisitions',
            'YoY': 'year over year',
            'QoQ': 'quarter over quarter'
        }
        
        # Stopwords (common words to remove)
        self.stopwords = set(['the', 'a', 'an', 'is', 'are', 'was', 'were', 
                             'be', 'been', 'being', 'have', 'has', 'had',
                             'do', 'does', 'did', 'will', 'would', 'could',
                             'should', 'may', 'might', 'must', 'shall',
                             'and', 'or', 'but', 'if', 'then', 'else',
                             'when', 'where', 'why', 'how', 'what', 'which',
                             'who', 'whom', 'this', 'that', 'these', 'those',
                             'to', 'of', 'in', 'for', 'on', 'with', 'at',
                             'by', 'from', 'as', 'into', 'through', 'during'])
    
    def clean(self, text: str) -> str:
        """Basic text cleaning."""
        # Lowercase
        text = text.lower()
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text)
        
        # Remove mentions and hashtags (for social media)
        text = re.sub(r'@\w+|#\w+', '', text)
        
        # Remove special characters but keep important punctuation
        text = re.sub(r'[^a-zA-Z0-9\s\.\!\?\%\$]', '', text)
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text
    
    def expand_abbreviations(self, text: str) -> str:
        """Expand financial abbreviations."""
        for abbr, expansion in self.abbreviations.items():
            text = re.sub(r'\b' + abbr + r'\b', expansion, text, flags=re.IGNORECASE)
        return text
    
    def remove_stopwords(self, text: str) -> str:
        """Remove common stopwords."""
        words = text.split()
        words = [w for w in words if w.lower() not in self.stopwords]
        return ' '.join(words)
    
    def process(self, text: str, remove_stops: bool = False) -> str:
        """Full preprocessing pipeline."""
        text = self.clean(text)
        text = self.expand_abbreviations(text)
        if remove_stops:
            text = self.remove_stopwords(text)
        return text

# Test preprocessing
preprocessor = TextPreprocessor()

test_text = "Apple's Q4 EPS beats estimates! $AAPL @Bloomberg #stocks"
print(f"Original: {test_text}")
print(f"Cleaned:  {preprocessor.process(test_text)}")
# Exercise 9.1: Financial Text Cleaner (Guided)

def clean_financial_text(text: str, extract_tickers: bool = True) -> Dict:
    """
    Clean financial text and optionally extract stock tickers.
    
    Returns:
        Dictionary with cleaned text and extracted information
    """
    result = {
        'original': text,
        'cleaned': '',
        'tickers': [],
        'numbers': [],
        'percentages': []
    }
    
    # TODO: Extract stock tickers (pattern: $AAPL or just uppercase 2-5 letters)
    if extract_tickers:
        ticker_pattern = r'\$([A-Z]{1,5})|\b([A-Z]{2,5})\b'
        matches = re.______(ticker_pattern, text)
        result['tickers'] = list(set([m[0] or m[1] for m in matches if m[0] or m[1]]))
    
    # TODO: Extract percentages
    pct_pattern = r'([-+]?\d+\.?\d*)\s*%'
    result['percentages'] = [float(p) for p in re.______(pct_pattern, text)]
    
    # TODO: Extract numbers with $ sign
    money_pattern = r'\$([\d,]+\.?\d*)'
    result['numbers'] = re.______(money_pattern, text)
    
    # Clean text
    cleaned = text.lower()
    cleaned = re.sub(r'[^a-zA-Z\s]', ' ', cleaned)
    cleaned = ' '.join(cleaned.split())
    result['cleaned'] = cleaned
    
    return result

# Test
# result = clean_financial_text("$AAPL jumps 5.2% after beating Q4 estimates by $0.15")
Solution 9.1
def clean_financial_text(text: str, extract_tickers: bool = True) -> Dict:
    """
    Clean financial text and optionally extract stock tickers.
    """
    result = {
        'original': text,
        'cleaned': '',
        'tickers': [],
        'numbers': [],
        'percentages': []
    }

    if extract_tickers:
        ticker_pattern = r'\$([A-Z]{1,5})|\b([A-Z]{2,5})\b'
        matches = re.findall(ticker_pattern, text)
        result['tickers'] = list(set([m[0] or m[1] for m in matches if m[0] or m[1]]))

    pct_pattern = r'([-+]?\d+\.?\d*)\s*%'
    result['percentages'] = [float(p) for p in re.findall(pct_pattern, text)]

    money_pattern = r'\$([\d,]+\.?\d*)'
    result['numbers'] = re.findall(money_pattern, text)

    cleaned = text.lower()
    cleaned = re.sub(r'[^a-zA-Z\s]', ' ', cleaned)
    cleaned = ' '.join(cleaned.split())
    result['cleaned'] = cleaned

    return result

Section 2: Lexicon-Based Sentiment

Using predefined sentiment dictionaries to score text.

# VADER Sentiment Analysis

if HAS_NLTK:
    sia = SentimentIntensityAnalyzer()
    
    print("VADER Sentiment Analysis:")
    print("=" * 60)
    
    for headline in sample_headlines[:5]:
        scores = sia.polarity_scores(headline)
        sentiment = 'Positive' if scores['compound'] > 0.05 else \
                   'Negative' if scores['compound'] < -0.05 else 'Neutral'
        print(f"\n'{headline[:50]}...'")
        print(f"  Compound: {scores['compound']:.3f} ({sentiment})")
        print(f"  Pos: {scores['pos']:.3f}, Neg: {scores['neg']:.3f}, Neu: {scores['neu']:.3f}")
else:
    print("NLTK not available")
# Custom Financial Sentiment Lexicon

class FinancialSentimentLexicon:
    """Custom sentiment scoring for financial text."""
    
    def __init__(self):
        # Financial-specific sentiment words
        self.positive_words = {
            # Strong positive
            'surge', 'soar', 'rally', 'boom', 'breakthrough', 'record',
            'beat', 'exceed', 'outperform', 'upgrade', 'bullish',
            # Moderate positive
            'gain', 'rise', 'grow', 'improve', 'strong', 'robust',
            'optimistic', 'profitable', 'positive', 'success',
            # Mild positive
            'stable', 'steady', 'maintain', 'recovery', 'rebound'
        }
        
        self.negative_words = {
            # Strong negative
            'crash', 'plunge', 'collapse', 'crisis', 'disaster',
            'bankrupt', 'default', 'miss', 'downgrade', 'bearish',
            # Moderate negative
            'fall', 'drop', 'decline', 'loss', 'weak', 'concern',
            'fear', 'risk', 'warning', 'struggle',
            # Mild negative
            'uncertainty', 'volatility', 'challenge', 'pressure', 'disappoint'
        }
        
        # Intensifiers and negations
        self.intensifiers = {'very', 'extremely', 'significantly', 'sharply', 'dramatically'}
        self.negations = {'not', 'no', 'never', 'neither', "n't", 'without', 'lack'}
        
        # Word weights
        self.word_weights = {
            # Strong words get higher weights
            'surge': 2.0, 'crash': -2.0, 'record': 1.5, 'crisis': -1.5,
            'beat': 1.2, 'miss': -1.2, 'upgrade': 1.5, 'downgrade': -1.5
        }
    
    def score(self, text: str) -> Dict:
        """Score sentiment of text."""
        text_lower = text.lower()
        words = text_lower.split()
        
        positive_count = 0
        negative_count = 0
        weighted_score = 0
        
        prev_word = ''
        for word in words:
            # Check for negation
            negated = prev_word in self.negations
            intensified = prev_word in self.intensifiers
            
            multiplier = 1.5 if intensified else 1.0
            if negated:
                multiplier *= -1
            
            if word in self.positive_words:
                weight = self.word_weights.get(word, 1.0)
                positive_count += 1
                weighted_score += weight * multiplier
            elif word in self.negative_words:
                weight = self.word_weights.get(word, -1.0)
                negative_count += 1
                weighted_score += weight * multiplier
            
            prev_word = word
        
        total_words = len(words)
        
        return {
            'positive_count': positive_count,
            'negative_count': negative_count,
            'weighted_score': weighted_score,
            'normalized_score': weighted_score / total_words if total_words > 0 else 0,
            'sentiment': 'positive' if weighted_score > 0.5 else 
                        'negative' if weighted_score < -0.5 else 'neutral'
        }

# Test
fin_lexicon = FinancialSentimentLexicon()

print("Financial Sentiment Lexicon:")
print("=" * 60)

for headline in sample_headlines[:5]:
    scores = fin_lexicon.score(headline)
    print(f"\n'{headline[:50]}...'")
    print(f"  Score: {scores['weighted_score']:.2f} ({scores['sentiment']})")
    print(f"  Pos words: {scores['positive_count']}, Neg words: {scores['negative_count']}")
# Exercise 9.2: Sentiment Scorer (Guided)

class SentimentScorer:
    """
    Combined sentiment scoring using multiple methods.
    """
    
    def __init__(self):
        self.vader = SentimentIntensityAnalyzer() if HAS_NLTK else None
        self.fin_lexicon = FinancialSentimentLexicon()
        
    def score_vader(self, text: str) -> float:
        """Get VADER compound score."""
        if self.vader:
            # TODO: Get VADER polarity scores and return compound
            scores = self.vader.______(text)
            return scores['______']
        return 0.0
    
    def score_financial(self, text: str) -> float:
        """Get financial lexicon score."""
        # TODO: Get financial lexicon scores and return normalized score
        scores = self.fin_lexicon.______(text)
        return scores['______']
    
    def score_combined(self, text: str, vader_weight: float = 0.5) -> Dict:
        """Combine VADER and financial lexicon scores."""
        vader_score = self.score_vader(text)
        fin_score = self.score_financial(text)
        
        # Normalize financial score to [-1, 1] range
        fin_normalized = np.clip(fin_score / 2, -1, 1)
        
        # Combined score
        combined = vader_weight * vader_score + (1 - vader_weight) * fin_normalized
        
        return {
            'vader': vader_score,
            'financial': fin_score,
            'combined': combined,
            'sentiment': 'positive' if combined > 0.1 else 'negative' if combined < -0.1 else 'neutral'
        }

# Test
# scorer = SentimentScorer()
# result = scorer.score_combined("Apple reports record earnings")
Solution 9.2
class SentimentScorer:
    def __init__(self):
        self.vader = SentimentIntensityAnalyzer() if HAS_NLTK else None
        self.fin_lexicon = FinancialSentimentLexicon()

    def score_vader(self, text: str) -> float:
        if self.vader:
            scores = self.vader.polarity_scores(text)
            return scores['compound']
        return 0.0

    def score_financial(self, text: str) -> float:
        scores = self.fin_lexicon.score(text)
        return scores['normalized_score']

    def score_combined(self, text: str, vader_weight: float = 0.5) -> Dict:
        vader_score = self.score_vader(text)
        fin_score = self.score_financial(text)

        fin_normalized = np.clip(fin_score / 2, -1, 1)
        combined = vader_weight * vader_score + (1 - vader_weight) * fin_normalized

        return {
            'vader': vader_score,
            'financial': fin_score,
            'combined': combined,
            'sentiment': 'positive' if combined > 0.1 else 'negative' if combined < -0.1 else 'neutral'
        }

Section 3: News Sentiment Features

Creating tradeable features from news sentiment.

# Simulate news data with timestamps

def generate_simulated_news(n_days: int = 252) -> pd.DataFrame:
    """Generate simulated news data for demonstration."""
    np.random.seed(42)
    
    dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='B')
    
    # Templates
    positive_templates = [
        "Stock surges on strong earnings report",
        "Analysts upgrade rating to buy",
        "Company announces record quarterly revenue",
        "Shares rally after positive guidance",
        "Investor optimism grows on deal news"
    ]
    
    negative_templates = [
        "Stock drops on disappointing results",
        "Analysts downgrade amid concerns",
        "Shares tumble on weak guidance",
        "Company faces regulatory challenges",
        "Investors worry about debt levels"
    ]
    
    neutral_templates = [
        "Company reports inline with expectations",
        "Stock trades sideways on mixed signals",
        "Analysts maintain hold rating",
        "Market awaits upcoming earnings release",
        "Trading volume remains steady"
    ]
    
    news_data = []
    for date in dates:
        # Generate 1-5 news items per day
        n_news = np.random.randint(1, 6)
        
        for _ in range(n_news):
            # Randomly select sentiment
            sentiment_type = np.random.choice(['positive', 'negative', 'neutral'], p=[0.35, 0.35, 0.3])
            
            if sentiment_type == 'positive':
                headline = np.random.choice(positive_templates)
            elif sentiment_type == 'negative':
                headline = np.random.choice(negative_templates)
            else:
                headline = np.random.choice(neutral_templates)
            
            news_data.append({
                'date': date,
                'headline': headline,
                'true_sentiment': sentiment_type
            })
    
    return pd.DataFrame(news_data)

# Generate news
news_df = generate_simulated_news()
print(f"Generated {len(news_df)} news items over {news_df['date'].nunique()} days")
print(f"\nSample:")
print(news_df.head(10).to_string(index=False))
# Create sentiment features from news

def create_sentiment_features(news_df: pd.DataFrame) -> pd.DataFrame:
    """Aggregate news sentiment into daily features."""
    scorer = SentimentScorer()
    
    # Score each headline
    news_df['sentiment_score'] = news_df['headline'].apply(
        lambda x: scorer.score_combined(x)['combined']
    )
    
    # Aggregate by date
    daily_features = news_df.groupby('date').agg({
        'sentiment_score': ['mean', 'std', 'min', 'max', 'count'],
        'headline': 'count'
    }).reset_index()
    
    # Flatten column names
    daily_features.columns = [
        'date', 'sentiment_mean', 'sentiment_std', 'sentiment_min',
        'sentiment_max', 'sentiment_count', 'news_count'
    ]
    
    # Calculate additional features
    daily_features['sentiment_range'] = daily_features['sentiment_max'] - daily_features['sentiment_min']
    daily_features['sentiment_skew'] = daily_features['sentiment_mean'] - \
                                        (daily_features['sentiment_max'] + daily_features['sentiment_min']) / 2
    
    # Rolling features
    daily_features = daily_features.set_index('date').sort_index()
    daily_features['sentiment_ma3'] = daily_features['sentiment_mean'].rolling(3).mean()
    daily_features['sentiment_ma7'] = daily_features['sentiment_mean'].rolling(7).mean()
    daily_features['sentiment_momentum'] = daily_features['sentiment_mean'] - daily_features['sentiment_ma7']
    
    return daily_features.dropna()

# Create features
sentiment_features = create_sentiment_features(news_df)
print(f"Daily sentiment features: {sentiment_features.shape}")
print(f"\nFeatures:")
print(sentiment_features.head())
# Visualize sentiment over time

fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Sentiment mean
axes[0].plot(sentiment_features.index, sentiment_features['sentiment_mean'], 
             'b-', alpha=0.7, label='Daily Mean')
axes[0].plot(sentiment_features.index, sentiment_features['sentiment_ma7'], 
             'r-', linewidth=2, label='7-Day MA')
axes[0].axhline(y=0, color='gray', linestyle='--')
axes[0].fill_between(sentiment_features.index, 
                     sentiment_features['sentiment_min'],
                     sentiment_features['sentiment_max'],
                     alpha=0.2, label='Min-Max Range')
axes[0].set_ylabel('Sentiment Score')
axes[0].set_title('Daily News Sentiment')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# News volume
axes[1].bar(sentiment_features.index, sentiment_features['news_count'], 
           alpha=0.7, color='steelblue')
axes[1].set_ylabel('News Count')
axes[1].set_title('Daily News Volume')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Exercise 9.3: Sentiment Feature Engineer (Guided)

def create_advanced_sentiment_features(news_df: pd.DataFrame, 
                                        lookback_days: List[int] = [3, 7, 14]) -> pd.DataFrame:
    """
    Create advanced sentiment features with multiple lookback periods.
    """
    scorer = SentimentScorer()
    
    # Score headlines
    news_df = news_df.copy()
    news_df['score'] = news_df['headline'].apply(
        lambda x: scorer.score_combined(x)['combined']
    )
    
    # TODO: Aggregate by date
    daily = news_df.groupby('date').agg({
        'score': ['mean', 'std', 'count']
    })
    daily.columns = ['sentiment', 'sentiment_std', 'news_count']
    daily = daily.______()
    
    # Fill missing dates
    full_dates = pd.date_range(daily.index.min(), daily.index.max(), freq='B')
    daily = daily.reindex(full_dates)
    daily['sentiment'] = daily['sentiment'].fillna(0)
    daily['news_count'] = daily['news_count'].fillna(0)
    
    # TODO: Create rolling features for each lookback period
    for days in lookback_days:
        daily[f'sentiment_ma{days}'] = daily['sentiment'].______(days).______()
        daily[f'sentiment_vol{days}'] = daily['sentiment'].______(days).______()
    
    # Sentiment momentum
    daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma7']
    
    # Sentiment acceleration
    daily['sentiment_accel'] = daily['sentiment_momentum'].diff()
    
    return daily.dropna()

# Test
# advanced_features = create_advanced_sentiment_features(news_df)
Solution 9.3
def create_advanced_sentiment_features(news_df: pd.DataFrame, 
                                        lookback_days: List[int] = [3, 7, 14]) -> pd.DataFrame:
    scorer = SentimentScorer()

    news_df = news_df.copy()
    news_df['score'] = news_df['headline'].apply(
        lambda x: scorer.score_combined(x)['combined']
    )

    daily = news_df.groupby('date').agg({
        'score': ['mean', 'std', 'count']
    })
    daily.columns = ['sentiment', 'sentiment_std', 'news_count']
    daily = daily.sort_index()

    full_dates = pd.date_range(daily.index.min(), daily.index.max(), freq='B')
    daily = daily.reindex(full_dates)
    daily['sentiment'] = daily['sentiment'].fillna(0)
    daily['news_count'] = daily['news_count'].fillna(0)

    for days in lookback_days:
        daily[f'sentiment_ma{days}'] = daily['sentiment'].rolling(days).mean()
        daily[f'sentiment_vol{days}'] = daily['sentiment'].rolling(days).std()

    daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma7']
    daily['sentiment_accel'] = daily['sentiment_momentum'].diff()

    return daily.dropna()

Section 4: Sentiment Trading Signals

Combining sentiment with price data for trading signals.

# Combine sentiment with price data

def combine_sentiment_with_price(sentiment_df: pd.DataFrame, 
                                  symbol: str = "SPY") -> pd.DataFrame:
    """Combine sentiment features with price data."""
    # Get price data
    ticker = yf.Ticker(symbol)
    price_df = ticker.history(period="1y")
    
    # Calculate price features
    price_df['returns'] = price_df['Close'].pct_change()
    price_df['volatility'] = price_df['returns'].rolling(20).std()
    
    for p in [5, 10, 20]:
        price_df[f'momentum_{p}'] = price_df['Close'].pct_change(p)
    
    # Target: next day direction
    price_df['target'] = (price_df['returns'].shift(-1) > 0).astype(int)
    
    # Merge with sentiment
    price_df.index = price_df.index.tz_localize(None)
    sentiment_df.index = pd.to_datetime(sentiment_df.index).tz_localize(None)
    
    combined = price_df.join(sentiment_df, how='left')
    
    # Fill missing sentiment with 0
    sentiment_cols = sentiment_df.columns
    combined[sentiment_cols] = combined[sentiment_cols].fillna(0)
    
    return combined.dropna()

# Combine data
combined_df = combine_sentiment_with_price(sentiment_features)
print(f"Combined data: {combined_df.shape}")
print(f"\nColumns: {combined_df.columns.tolist()}")
# Build sentiment-enhanced prediction model

def build_sentiment_model(combined_df: pd.DataFrame) -> Dict:
    """Build and evaluate sentiment-enhanced model."""
    # Define features
    price_features = ['volatility', 'momentum_5', 'momentum_10', 'momentum_20']
    sentiment_features_cols = ['sentiment_mean', 'sentiment_ma7', 'sentiment_momentum']
    
    # Available columns
    price_available = [f for f in price_features if f in combined_df.columns]
    sentiment_available = [f for f in sentiment_features_cols if f in combined_df.columns]
    
    # Models
    results = {}
    
    # Split
    split_idx = int(len(combined_df) * 0.8)
    
    scaler = StandardScaler()
    
    # Model 1: Price only
    if price_available:
        X_price = combined_df[price_available]
        y = combined_df['target']
        
        X_train, X_test = X_price[:split_idx], X_price[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]
        
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        model.fit(X_train_scaled, y_train)
        
        results['price_only'] = {
            'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
            'features': price_available
        }
    
    # Model 2: Price + Sentiment
    all_features = price_available + sentiment_available
    if all_features:
        X_all = combined_df[all_features]
        
        X_train, X_test = X_all[:split_idx], X_all[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]
        
        scaler_all = StandardScaler()
        X_train_scaled = scaler_all.fit_transform(X_train)
        X_test_scaled = scaler_all.transform(X_test)
        
        model_all = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        model_all.fit(X_train_scaled, y_train)
        
        results['price_sentiment'] = {
            'accuracy': accuracy_score(y_test, model_all.predict(X_test_scaled)),
            'features': all_features,
            'feature_importance': dict(zip(all_features, model_all.feature_importances_))
        }
    
    return results

# Build models
model_results = build_sentiment_model(combined_df)

print("Model Comparison:")
print("=" * 40)
for name, result in model_results.items():
    print(f"\n{name}:")
    print(f"  Accuracy: {result['accuracy']:.2%}")
    print(f"  Features: {result['features']}")
    if 'feature_importance' in result:
        print(f"  Top Features:")
        for feat, imp in sorted(result['feature_importance'].items(), key=lambda x: -x[1])[:3]:
            print(f"    {feat}: {imp:.4f}")
# Exercise 9.4: Sentiment Trading System (Open-ended)
#
# Build a SentimentTradingSystem class that:
# - Processes news headlines to extract sentiment
# - Creates daily sentiment features
# - Combines with price data for signals
# - Generates buy/sell signals based on sentiment thresholds
# - Backtests the strategy and reports performance
#
# Your implementation:
Solution 9.4
class SentimentTradingSystem:
    """Trading system based on news sentiment."""

    def __init__(self, buy_threshold: float = 0.2, sell_threshold: float = -0.2):
        self.buy_threshold = buy_threshold
        self.sell_threshold = sell_threshold
        self.scorer = SentimentScorer()
        self.model = None

    def score_news(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Score all news headlines."""
        news_df = news_df.copy()
        news_df['sentiment'] = news_df['headline'].apply(
            lambda x: self.scorer.score_combined(x)['combined']
        )
        return news_df

    def aggregate_daily(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate to daily sentiment."""
        daily = news_df.groupby('date').agg({
            'sentiment': ['mean', 'std', 'count']
        })
        daily.columns = ['sentiment', 'sentiment_std', 'news_count']
        daily['sentiment_ma5'] = daily['sentiment'].rolling(5).mean()
        daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma5']
        return daily.dropna()

    def generate_signals(self, sentiment_df: pd.DataFrame) -> pd.DataFrame:
        """Generate trading signals."""
        signals = pd.DataFrame(index=sentiment_df.index)
        signals['sentiment'] = sentiment_df['sentiment']
        signals['signal'] = 0

        # Buy signal
        signals.loc[sentiment_df['sentiment'] > self.buy_threshold, 'signal'] = 1
        # Sell signal
        signals.loc[sentiment_df['sentiment'] < self.sell_threshold, 'signal'] = -1

        signals['position'] = signals['signal'].replace(0, np.nan).ffill().fillna(0)

        return signals

    def backtest(self, signals: pd.DataFrame, prices: pd.DataFrame) -> pd.DataFrame:
        """Backtest the sentiment strategy."""
        prices.index = prices.index.tz_localize(None)

        aligned = signals.join(prices[['Close']], how='inner')
        aligned['returns'] = aligned['Close'].pct_change()
        aligned['strategy_returns'] = aligned['position'].shift(1) * aligned['returns']

        aligned['cum_returns'] = (1 + aligned['returns']).cumprod()
        aligned['cum_strategy'] = (1 + aligned['strategy_returns'].fillna(0)).cumprod()

        return aligned

    def evaluate(self, backtest_results: pd.DataFrame) -> Dict:
        """Evaluate strategy performance."""
        strat_rets = backtest_results['strategy_returns'].dropna()

        return {
            'total_return': backtest_results['cum_strategy'].iloc[-1] - 1,
            'buy_hold_return': backtest_results['cum_returns'].iloc[-1] - 1,
            'sharpe': np.sqrt(252) * strat_rets.mean() / strat_rets.std(),
            'win_rate': (strat_rets > 0).mean(),
            'n_trades': (backtest_results['signal'] != 0).sum()
        }

    def plot_results(self, backtest_results: pd.DataFrame):
        """Visualize results."""
        fig, axes = plt.subplots(2, 1, figsize=(14, 8))

        axes[0].plot(backtest_results['cum_strategy'], label='Strategy')
        axes[0].plot(backtest_results['cum_returns'], label='Buy & Hold')
        axes[0].set_title('Sentiment Strategy Performance')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)

        axes[1].plot(backtest_results['sentiment'], alpha=0.7)
        axes[1].axhline(y=self.buy_threshold, color='green', linestyle='--')
        axes[1].axhline(y=self.sell_threshold, color='red', linestyle='--')
        axes[1].set_title('Sentiment with Thresholds')
        axes[1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()
# Exercise 9.5: News Impact Analyzer (Open-ended)
#
# Build a NewsImpactAnalyzer class that:
# - Measures price impact after news events
# - Categorizes news by sentiment intensity
# - Calculates average returns for each sentiment category
# - Identifies which types of news have the most impact
# - Provides statistical significance tests
#
# Your implementation:
Solution 9.5
from scipy import stats

class NewsImpactAnalyzer:
    """Analyze impact of news on prices."""

    def __init__(self, impact_windows: List[int] = [1, 3, 5]):
        self.impact_windows = impact_windows
        self.scorer = SentimentScorer()

    def score_and_categorize(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Score news and categorize by intensity."""
        news_df = news_df.copy()
        news_df['sentiment'] = news_df['headline'].apply(
            lambda x: self.scorer.score_combined(x)['combined']
        )

        # Categorize
        news_df['category'] = pd.cut(
            news_df['sentiment'],
            bins=[-np.inf, -0.3, -0.1, 0.1, 0.3, np.inf],
            labels=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']
        )

        return news_df

    def calculate_impact(self, news_df: pd.DataFrame, 
                         price_df: pd.DataFrame) -> pd.DataFrame:
        """Calculate price impact for each news item."""
        news_df = news_df.copy()
        price_df.index = price_df.index.tz_localize(None)

        for window in self.impact_windows:
            impacts = []
            for _, row in news_df.iterrows():
                date = row['date']
                if date in price_df.index:
                    try:
                        idx = price_df.index.get_loc(date)
                        if idx + window < len(price_df):
                            impact = (
                                price_df.iloc[idx + window]['Close'] / 
                                price_df.iloc[idx]['Close'] - 1
                            )
                        else:
                            impact = np.nan
                    except:
                        impact = np.nan
                else:
                    impact = np.nan
                impacts.append(impact)

            news_df[f'impact_{window}d'] = impacts

        return news_df

    def analyze_by_category(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Analyze impact by sentiment category."""
        results = []

        for category in news_df['category'].unique():
            cat_df = news_df[news_df['category'] == category]

            for window in self.impact_windows:
                impact_col = f'impact_{window}d'
                impacts = cat_df[impact_col].dropna()

                if len(impacts) > 0:
                    # T-test against zero
                    t_stat, p_value = stats.ttest_1samp(impacts, 0)

                    results.append({
                        'category': category,
                        'window': window,
                        'mean_impact': impacts.mean(),
                        'std_impact': impacts.std(),
                        'n_samples': len(impacts),
                        't_stat': t_stat,
                        'p_value': p_value,
                        'significant': p_value < 0.05
                    })

        return pd.DataFrame(results)

    def plot_impact(self, analysis_df: pd.DataFrame):
        """Visualize impact by category."""
        pivot = analysis_df.pivot(
            index='category', 
            columns='window', 
            values='mean_impact'
        )

        plt.figure(figsize=(10, 6))
        pivot.plot(kind='bar', ax=plt.gca())
        plt.title('Average Price Impact by Sentiment Category')
        plt.xlabel('Sentiment Category')
        plt.ylabel('Mean Return')
        plt.legend(title='Days After')
        plt.xticks(rotation=45)
        plt.axhline(y=0, color='black', linestyle='--')
        plt.tight_layout()
        plt.show()
# Exercise 9.6: Complete Sentiment Pipeline (Open-ended)
#
# Build a SentimentPipeline class that:
# - Ingests raw news data
# - Cleans and preprocesses text
# - Scores sentiment using multiple methods
# - Creates tradeable features
# - Builds and evaluates ML models
# - Generates signals and backtests
# - Produces a comprehensive report
#
# Your implementation:
Solution 9.6
class SentimentPipeline:
    """End-to-end sentiment analysis pipeline."""

    def __init__(self):
        self.preprocessor = TextPreprocessor()
        self.scorer = SentimentScorer()
        self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        self.scaler = StandardScaler()
        self.results = {}

    def preprocess(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Clean and preprocess news."""
        news_df = news_df.copy()
        news_df['cleaned'] = news_df['headline'].apply(self.preprocessor.process)
        return news_df

    def score(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Score sentiment."""
        news_df['sentiment'] = news_df['headline'].apply(
            lambda x: self.scorer.score_combined(x)['combined']
        )
        return news_df

    def create_features(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Create daily features."""
        daily = news_df.groupby('date').agg({
            'sentiment': ['mean', 'std', 'min', 'max', 'count']
        })
        daily.columns = ['sent_mean', 'sent_std', 'sent_min', 'sent_max', 'news_count']

        # Rolling features
        for w in [3, 7, 14]:
            daily[f'sent_ma{w}'] = daily['sent_mean'].rolling(w).mean()
            daily[f'sent_vol{w}'] = daily['sent_mean'].rolling(w).std()

        daily['sent_momentum'] = daily['sent_mean'] - daily['sent_ma7']

        return daily.dropna()

    def combine_with_price(self, features: pd.DataFrame, 
                           symbol: str = 'SPY') -> pd.DataFrame:
        """Combine with price data."""
        ticker = yf.Ticker(symbol)
        prices = ticker.history(period='1y')

        prices['returns'] = prices['Close'].pct_change()
        prices['volatility'] = prices['returns'].rolling(20).std()
        prices['target'] = (prices['returns'].shift(-1) > 0).astype(int)

        prices.index = prices.index.tz_localize(None)
        combined = prices.join(features, how='left').dropna()

        return combined

    def train_model(self, combined: pd.DataFrame, test_frac: float = 0.2):
        """Train prediction model."""
        feature_cols = ['volatility', 'sent_mean', 'sent_ma7', 'sent_momentum']
        available = [c for c in feature_cols if c in combined.columns]

        X = combined[available]
        y = combined['target']

        split_idx = int(len(X) * (1 - test_frac))
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]

        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)

        self.model.fit(X_train_scaled, y_train)

        self.results['train_accuracy'] = self.model.score(X_train_scaled, y_train)
        self.results['test_accuracy'] = self.model.score(X_test_scaled, y_test)
        self.results['feature_importance'] = dict(zip(available, self.model.feature_importances_))

        return self

    def run_pipeline(self, news_df: pd.DataFrame, symbol: str = 'SPY') -> Dict:
        """Run full pipeline."""
        print("Preprocessing...")
        news_df = self.preprocess(news_df)

        print("Scoring sentiment...")
        news_df = self.score(news_df)

        print("Creating features...")
        features = self.create_features(news_df)

        print("Combining with prices...")
        combined = self.combine_with_price(features, symbol)

        print("Training model...")
        self.train_model(combined)

        print("\nPipeline Complete!")
        return self.results

    def generate_report(self) -> str:
        """Generate text report."""
        report = f"""Sentiment Pipeline Report
========================

Model Performance:
  Train Accuracy: {self.results.get('train_accuracy', 0):.2%}
  Test Accuracy: {self.results.get('test_accuracy', 0):.2%}

Feature Importance:
"""
        for feat, imp in sorted(
            self.results.get('feature_importance', {}).items(),
            key=lambda x: -x[1]
        ):
            report += f"  {feat}: {imp:.4f}\n"

        return report

Module Project: News Sentiment Trading System

Build a complete system that combines news sentiment analysis with trading signals.

class NewsSentimentTradingSystem:
    """
    Complete news sentiment trading system.
    
    Features:
    - Multi-method sentiment scoring
    - Feature engineering for sentiment
    - ML model for signal generation
    - Backtesting and performance analysis
    """
    
    def __init__(self):
        self.preprocessor = TextPreprocessor()
        self.scorer = SentimentScorer()
        self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        self.scaler = StandardScaler()
        
    def process_news(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Process raw news data."""
        news_df = news_df.copy()
        
        # Clean text
        news_df['cleaned'] = news_df['headline'].apply(self.preprocessor.process)
        
        # Score sentiment
        news_df['sentiment'] = news_df['headline'].apply(
            lambda x: self.scorer.score_combined(x)['combined']
        )
        
        return news_df
    
    def create_daily_features(self, news_df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate news to daily features."""
        daily = news_df.groupby('date').agg({
            'sentiment': ['mean', 'std', 'min', 'max', 'count']
        })
        daily.columns = ['sent_mean', 'sent_std', 'sent_min', 'sent_max', 'news_count']
        daily = daily.sort_index()
        
        # Rolling features
        for window in [3, 7, 14]:
            daily[f'sent_ma{window}'] = daily['sent_mean'].rolling(window).mean()
        
        # Momentum and volatility
        daily['sent_momentum'] = daily['sent_mean'] - daily['sent_ma7']
        daily['sent_vol7'] = daily['sent_mean'].rolling(7).std()
        
        return daily.dropna()
    
    def prepare_training_data(self, sentiment_features: pd.DataFrame, 
                              symbol: str = "SPY") -> pd.DataFrame:
        """Prepare training data with price and sentiment."""
        # Get price data
        ticker = yf.Ticker(symbol)
        prices = ticker.history(period="1y")
        
        # Price features
        prices['returns'] = prices['Close'].pct_change()
        prices['volatility'] = prices['returns'].rolling(20).std()
        prices['momentum_5'] = prices['Close'].pct_change(5)
        prices['momentum_20'] = prices['Close'].pct_change(20)
        
        # Target
        prices['target'] = (prices['returns'].shift(-1) > 0).astype(int)
        
        # Merge
        prices.index = prices.index.tz_localize(None)
        combined = prices.join(sentiment_features, how='left')
        
        # Fill missing sentiment
        sent_cols = sentiment_features.columns
        combined[sent_cols] = combined[sent_cols].fillna(0)
        
        return combined.dropna()
    
    def fit(self, combined_df: pd.DataFrame, test_frac: float = 0.2):
        """Train the trading model."""
        # Features
        feature_cols = ['volatility', 'momentum_5', 'momentum_20',
                       'sent_mean', 'sent_ma7', 'sent_momentum', 'sent_vol7']
        available = [c for c in feature_cols if c in combined_df.columns]
        
        X = combined_df[available]
        y = combined_df['target']
        
        # Split
        split_idx = int(len(X) * (1 - test_frac))
        self.X_train = X[:split_idx]
        self.X_test = X[split_idx:]
        self.y_train = y[:split_idx]
        self.y_test = y[split_idx:]
        self.test_returns = combined_df['returns'][split_idx:]
        
        # Scale and train
        X_train_scaled = self.scaler.fit_transform(self.X_train)
        self.model.fit(X_train_scaled, self.y_train)
        
        self.feature_names = available
        
        return self
    
    def evaluate(self) -> Dict:
        """Evaluate model performance."""
        X_train_scaled = self.scaler.transform(self.X_train)
        X_test_scaled = self.scaler.transform(self.X_test)
        
        y_pred = self.model.predict(X_test_scaled)
        
        # Classification metrics
        train_acc = self.model.score(X_train_scaled, self.y_train)
        test_acc = self.model.score(X_test_scaled, self.y_test)
        
        # Financial metrics
        pred_series = pd.Series(y_pred, index=self.y_test.index)
        strategy_returns = pred_series.shift(1) * self.test_returns
        strategy_returns = strategy_returns.dropna()
        
        total_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
        bh_return = (1 + self.test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
        sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
        
        return {
            'train_accuracy': train_acc,
            'test_accuracy': test_acc,
            'total_return': total_return,
            'buy_hold_return': bh_return,
            'outperformance': total_return - bh_return,
            'sharpe_ratio': sharpe,
            'feature_importance': dict(zip(self.feature_names, self.model.feature_importances_))
        }
    
    def plot_results(self):
        """Visualize results."""
        X_test_scaled = self.scaler.transform(self.X_test)
        y_pred = self.model.predict(X_test_scaled)
        
        pred_series = pd.Series(y_pred, index=self.y_test.index)
        strategy_returns = pred_series.shift(1) * self.test_returns
        
        cum_strategy = (1 + strategy_returns.fillna(0)).cumprod()
        cum_bh = (1 + self.test_returns.fillna(0)).cumprod()
        
        fig, axes = plt.subplots(2, 1, figsize=(14, 10))
        
        # Cumulative returns
        axes[0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
        axes[0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
        axes[0].set_ylabel('Cumulative Return')
        axes[0].set_title('Sentiment Trading Strategy Performance')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Feature importance
        importance = pd.Series(
            self.model.feature_importances_,
            index=self.feature_names
        ).sort_values()
        
        axes[1].barh(importance.index, importance.values, color='steelblue')
        axes[1].set_xlabel('Feature Importance')
        axes[1].set_title('Model Feature Importance')
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
# Run the complete system

# Generate simulated news
news_data = generate_simulated_news(252)

# Create system
system = NewsSentimentTradingSystem()

# Process news
print("Processing news...")
processed_news = system.process_news(news_data)

# Create features
print("Creating features...")
sentiment_features = system.create_daily_features(processed_news)

# Prepare training data
print("Preparing training data...")
combined = system.prepare_training_data(sentiment_features)

# Train
print("Training model...")
system.fit(combined)

# Evaluate
results = system.evaluate()

print("\n" + "="*50)
print("SENTIMENT TRADING SYSTEM RESULTS")
print("="*50)
print(f"\nClassification Metrics:")
print(f"  Train Accuracy: {results['train_accuracy']:.2%}")
print(f"  Test Accuracy:  {results['test_accuracy']:.2%}")
print(f"\nFinancial Metrics:")
print(f"  Strategy Return: {results['total_return']:.2%}")
print(f"  Buy & Hold:      {results['buy_hold_return']:.2%}")
print(f"  Outperformance:  {results['outperformance']:.2%}")
print(f"  Sharpe Ratio:    {results['sharpe_ratio']:.2f}")
print(f"\nTop Features:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1])[:5]:
    print(f"  {feat}: {imp:.4f}")
# Visualize

system.plot_results()

Key Takeaways

  1. Text preprocessing is critical: clean, normalize, and extract relevant entities from financial text

  2. Lexicon-based methods (VADER, custom dictionaries) are fast but may miss context

  3. Financial-specific lexicons outperform general sentiment tools for market data

  4. Sentiment features should include means, volatility, momentum, and rolling statistics

  5. Combining sentiment with price features often improves prediction accuracy

  6. News impact analysis helps identify which sentiment signals are most predictive

  7. Real-time news sources (Twitter, news APIs) provide actionable signals but require careful latency management


Next: Module 10 - Alternative Data (Web scraping, social media, multi-source features)

Module 10: Alternative Data

Part 3: Advanced Techniques

Duration Exercises Prerequisites
~2.5 hours 6 Modules 1-9

Learning Objectives

By the end of this module, you will be able to: - Understand alternative data sources for trading - Collect data from web and API sources - Process and clean social media data - Combine multiple data sources into features - Build multi-source prediction models

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import json
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Data collection
try:
    import requests
    HAS_REQUESTS = True
except ImportError:
    HAS_REQUESTS = False
    print("requests not installed. Install with: pip install requests")

try:
    from bs4 import BeautifulSoup
    HAS_BS4 = True
except ImportError:
    HAS_BS4 = False
    print("BeautifulSoup not installed. Install with: pip install beautifulsoup4")

# ML
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report

import yfinance as yf

print("Module 10: Alternative Data")
print("=" * 40)

Section 1: Alternative Data Sources

Understanding the landscape of non-traditional data for trading.

# Alternative Data Overview

alt_data_overview = """
ALTERNATIVE DATA FOR TRADING
============================

What is Alternative Data?
-------------------------
Non-traditional data sources beyond price/volume that provide
insights into economic activity, company performance, or sentiment.

Categories:
-----------

1. SOCIAL & SENTIMENT
   - Twitter/X posts and trends
   - Reddit (r/wallstreetbets, r/investing)
   - StockTwits messages
   - News headlines and articles
   - Analyst reports

2. WEB DATA
   - Google Trends search volume
   - Website traffic (SimilarWeb)
   - Job postings (LinkedIn, Indeed)
   - Product reviews and ratings
   - Price comparison sites

3. TRANSACTION DATA
   - Credit card transactions
   - Point of sale data
   - App usage metrics
   - Email receipts

4. GEOSPATIAL DATA
   - Satellite imagery
   - GPS/location data
   - Foot traffic counts
   - Shipping/logistics tracking

5. GOVERNMENT & ECONOMIC
   - SEC filings
   - Patent applications
   - Building permits
   - Import/export data

Considerations:
---------------
- Data quality and consistency
- Latency and timeliness
- Cost of acquisition
- Legal/compliance issues
- Alpha decay (as data becomes common)
"""
print(alt_data_overview)
# Simulate alternative data sources

def generate_simulated_alt_data(n_days: int = 252, symbol: str = "AAPL") -> Dict[str, pd.DataFrame]:
    """Generate simulated alternative data for demonstration."""
    np.random.seed(42)
    
    dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='B')
    
    # 1. Social Media Metrics
    social_data = pd.DataFrame({
        'date': dates,
        'twitter_mentions': np.random.poisson(500, n_days),
        'twitter_sentiment': np.random.normal(0.1, 0.3, n_days),
        'reddit_posts': np.random.poisson(50, n_days),
        'reddit_comments': np.random.poisson(200, n_days),
        'stocktwits_messages': np.random.poisson(100, n_days),
        'stocktwits_bullish_pct': np.random.beta(6, 4, n_days)  # Slightly bullish bias
    }).set_index('date')
    
    # 2. Web Traffic Data
    base_traffic = 1000000 + np.cumsum(np.random.normal(0, 50000, n_days))
    web_data = pd.DataFrame({
        'date': dates,
        'website_visits': np.maximum(base_traffic, 500000).astype(int),
        'app_downloads': np.random.poisson(10000, n_days),
        'google_trend_score': np.clip(np.random.normal(60, 15, n_days), 0, 100),
        'product_reviews': np.random.poisson(500, n_days),
        'avg_review_score': np.random.normal(4.2, 0.3, n_days).clip(1, 5)
    }).set_index('date')
    
    # 3. Job Posting Data
    base_jobs = 200 + np.cumsum(np.random.normal(0, 5, n_days))
    job_data = pd.DataFrame({
        'date': dates,
        'job_postings': np.maximum(base_jobs, 100).astype(int),
        'engineering_jobs': np.random.poisson(50, n_days),
        'sales_jobs': np.random.poisson(30, n_days),
        'avg_salary_listed': np.random.normal(120000, 20000, n_days)
    }).set_index('date')
    
    # 4. Satellite/Foot Traffic Data
    geo_data = pd.DataFrame({
        'date': dates,
        'store_foot_traffic': np.random.poisson(5000, n_days),
        'parking_lot_fill': np.random.beta(5, 3, n_days),
        'shipping_containers': np.random.poisson(1000, n_days)
    }).set_index('date')
    
    return {
        'social': social_data,
        'web': web_data,
        'jobs': job_data,
        'geo': geo_data
    }

# Generate data
alt_data = generate_simulated_alt_data()

print("Generated Alternative Data:")
for source, df in alt_data.items():
    print(f"\n{source.upper()}:")
    print(f"  Columns: {df.columns.tolist()}")
    print(f"  Shape: {df.shape}")
# Visualize alternative data

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Social media
ax1 = axes[0, 0]
ax1.plot(alt_data['social'].index, alt_data['social']['twitter_mentions'], 
         label='Twitter Mentions', alpha=0.7)
ax1.set_ylabel('Mentions')
ax1.set_title('Social Media Activity')
ax1_twin = ax1.twinx()
ax1_twin.plot(alt_data['social'].index, alt_data['social']['twitter_sentiment'], 
              'r-', label='Sentiment', alpha=0.7)
ax1_twin.set_ylabel('Sentiment', color='r')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)

# Web traffic
axes[0, 1].plot(alt_data['web'].index, alt_data['web']['website_visits'], 
                label='Website Visits')
axes[0, 1].plot(alt_data['web'].index, alt_data['web']['google_trend_score'] * 20000, 
                label='Google Trends (scaled)', alpha=0.7)
axes[0, 1].set_ylabel('Visits')
axes[0, 1].set_title('Web Traffic Metrics')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Job postings
axes[1, 0].fill_between(alt_data['jobs'].index, alt_data['jobs']['job_postings'], 
                        alpha=0.5, label='Total Jobs')
axes[1, 0].plot(alt_data['jobs'].index, alt_data['jobs']['engineering_jobs'] * 4, 
                label='Engineering (x4)', alpha=0.8)
axes[1, 0].set_ylabel('Job Postings')
axes[1, 0].set_title('Job Market Indicators')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Geospatial
axes[1, 1].plot(alt_data['geo'].index, alt_data['geo']['store_foot_traffic'], 
                label='Foot Traffic')
axes[1, 1].set_ylabel('Daily Visitors')
axes[1, 1].set_title('Physical Activity Indicators')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Exercise 10.1: Alt Data Feature Calculator (Guided)

def calculate_alt_data_features(alt_data: Dict[str, pd.DataFrame], 
                                 lookback_days: List[int] = [5, 20]) -> pd.DataFrame:
    """
    Calculate features from alternative data sources.
    
    Returns:
        DataFrame with all calculated features
    """
    features = pd.DataFrame()
    
    # Social features
    social = alt_data['social']
    features['social_sentiment'] = social['twitter_sentiment']
    features['social_volume'] = social['twitter_mentions'] + social['reddit_posts'] * 10
    features['bullish_ratio'] = social['stocktwits_bullish_pct']
    
    # TODO: Calculate rolling features for social
    for days in lookback_days:
        features[f'sentiment_ma{days}'] = social['twitter_sentiment'].______(days).______()
        features[f'volume_ma{days}'] = features['social_volume'].______(days).______()
    
    # Web features
    web = alt_data['web']
    features['web_traffic'] = web['website_visits']
    features['google_trends'] = web['google_trend_score']
    features['review_score'] = web['avg_review_score']
    
    # TODO: Calculate traffic changes
    features['traffic_change_5d'] = web['website_visits'].______(5)
    features['traffic_change_20d'] = web['website_visits'].______(20)
    
    # Job features
    jobs = alt_data['jobs']
    features['job_postings'] = jobs['job_postings']
    features['job_growth'] = jobs['job_postings'].pct_change(20)
    features['eng_to_sales'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)
    
    # Geo features
    geo = alt_data['geo']
    features['foot_traffic'] = geo['store_foot_traffic']
    features['parking_fill'] = geo['parking_lot_fill']
    
    return features.dropna()

# Test
# alt_features = calculate_alt_data_features(alt_data)
Solution 10.1
def calculate_alt_data_features(alt_data: Dict[str, pd.DataFrame], 
                                 lookback_days: List[int] = [5, 20]) -> pd.DataFrame:
    features = pd.DataFrame()

    social = alt_data['social']
    features['social_sentiment'] = social['twitter_sentiment']
    features['social_volume'] = social['twitter_mentions'] + social['reddit_posts'] * 10
    features['bullish_ratio'] = social['stocktwits_bullish_pct']

    for days in lookback_days:
        features[f'sentiment_ma{days}'] = social['twitter_sentiment'].rolling(days).mean()
        features[f'volume_ma{days}'] = features['social_volume'].rolling(days).mean()

    web = alt_data['web']
    features['web_traffic'] = web['website_visits']
    features['google_trends'] = web['google_trend_score']
    features['review_score'] = web['avg_review_score']

    features['traffic_change_5d'] = web['website_visits'].pct_change(5)
    features['traffic_change_20d'] = web['website_visits'].pct_change(20)

    jobs = alt_data['jobs']
    features['job_postings'] = jobs['job_postings']
    features['job_growth'] = jobs['job_postings'].pct_change(20)
    features['eng_to_sales'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)

    geo = alt_data['geo']
    features['foot_traffic'] = geo['store_foot_traffic']
    features['parking_fill'] = geo['parking_lot_fill']

    return features.dropna()

Section 2: Web Data Collection

Techniques for collecting data from web sources.

# Web Scraping Best Practices

web_scraping_guide = """
WEB DATA COLLECTION
===================

Data Collection Methods:
------------------------
1. APIs (Preferred)
   - Official, structured access
   - Rate limits and authentication
   - Examples: Twitter API, Reddit API, Alpha Vantage

2. Web Scraping
   - Parse HTML content
   - Tools: BeautifulSoup, Scrapy, Selenium
   - Requires understanding of HTML structure

3. Data Vendors
   - Pre-processed alternative data
   - Examples: Quandl, Bloomberg, Refinitiv
   - Higher cost, higher quality

Ethical Considerations:
-----------------------
- Respect robots.txt
- Rate limit your requests
- Don't overload servers
- Respect terms of service
- Handle personal data carefully

Technical Best Practices:
-------------------------
- Use user-agent headers
- Implement exponential backoff
- Cache responses
- Handle errors gracefully
- Log all requests

Common Data Sources:
--------------------
- SEC EDGAR (company filings)
- Google Trends
- GitHub activity
- Wikipedia pageviews
- Job boards (Indeed, LinkedIn)
"""
print(web_scraping_guide)
# Web Data Collector Class

class WebDataCollector:
    """Collect data from web sources with rate limiting and caching."""
    
    def __init__(self, rate_limit: float = 1.0):
        """
        Args:
            rate_limit: Minimum seconds between requests
        """
        self.rate_limit = rate_limit
        self.last_request = datetime.min
        self.cache = {}
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Educational/Research Purpose)'
        }
    
    def _wait_for_rate_limit(self):
        """Ensure we don't exceed rate limit."""
        elapsed = (datetime.now() - self.last_request).total_seconds()
        if elapsed < self.rate_limit:
            import time
            time.sleep(self.rate_limit - elapsed)
        self.last_request = datetime.now()
    
    def fetch_url(self, url: str, use_cache: bool = True) -> Optional[str]:
        """Fetch content from URL."""
        if not HAS_REQUESTS:
            print("requests library not available")
            return None
        
        # Check cache
        if use_cache and url in self.cache:
            return self.cache[url]
        
        self._wait_for_rate_limit()
        
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            content = response.text
            
            # Cache result
            self.cache[url] = content
            return content
            
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None
    
    def parse_html(self, html: str, selector: str) -> List[str]:
        """Parse HTML and extract text from elements."""
        if not HAS_BS4 or not html:
            return []
        
        soup = BeautifulSoup(html, 'html.parser')
        elements = soup.select(selector)
        return [elem.get_text(strip=True) for elem in elements]
    
    def fetch_json(self, url: str) -> Optional[Dict]:
        """Fetch and parse JSON from URL."""
        content = self.fetch_url(url)
        if content:
            try:
                return json.loads(content)
            except json.JSONDecodeError:
                print(f"Invalid JSON from {url}")
        return None

# Example usage
collector = WebDataCollector(rate_limit=1.0)
print("WebDataCollector initialized")
print(f"  Rate limit: {collector.rate_limit}s")
print(f"  User-Agent: {collector.headers['User-Agent']}")
# Simulated API data (for demonstration without actual API calls)

def simulate_google_trends_data(keyword: str, n_days: int = 90) -> pd.DataFrame:
    """Simulate Google Trends data."""
    np.random.seed(hash(keyword) % 2**32)
    
    dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='D')
    
    # Generate trend with weekly seasonality and random noise
    trend = np.random.normal(50, 10, n_days)
    seasonality = 10 * np.sin(2 * np.pi * np.arange(n_days) / 7)
    values = np.clip(trend + seasonality, 0, 100)
    
    return pd.DataFrame({
        'date': dates,
        'keyword': keyword,
        'interest': values.astype(int)
    }).set_index('date')

def simulate_reddit_data(subreddit: str, n_days: int = 90) -> pd.DataFrame:
    """Simulate Reddit activity data."""
    np.random.seed(hash(subreddit) % 2**32)
    
    dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='D')
    
    return pd.DataFrame({
        'date': dates,
        'subreddit': subreddit,
        'posts': np.random.poisson(50, n_days),
        'comments': np.random.poisson(500, n_days),
        'avg_score': np.random.exponential(100, n_days),
        'sentiment': np.random.normal(0.1, 0.4, n_days)
    }).set_index('date')

# Get simulated data
google_data = simulate_google_trends_data("Tesla stock")
reddit_data = simulate_reddit_data("wallstreetbets")

print("Simulated Google Trends:")
print(google_data.head())
print("\nSimulated Reddit Data:")
print(reddit_data.head())
# Exercise 10.2: Multi-Source Data Aggregator (Guided)

class MultiSourceAggregator:
    """
    Aggregate data from multiple alternative sources.
    """
    
    def __init__(self):
        self.sources = {}
        self.combined_data = None
        
    def add_source(self, name: str, data: pd.DataFrame, 
                   date_col: str = None):
        """Add a data source."""
        df = data.copy()
        
        # Ensure date index
        if date_col and date_col in df.columns:
            df = df.set_index(date_col)
        
        # Prefix columns with source name
        df.columns = [f'{name}_{col}' for col in df.columns]
        
        self.sources[name] = df
        return self
    
    def combine(self, fill_method: str = 'ffill') -> pd.DataFrame:
        """Combine all sources into single DataFrame."""
        if not self.sources:
            return pd.DataFrame()
        
        # TODO: Start with first source
        combined = list(self.sources.______())[0].copy()
        
        # TODO: Join remaining sources
        for name, df in list(self.sources.items())[1:]:
            combined = combined.______(df, how='outer')
        
        # Fill missing values
        if fill_method == 'ffill':
            combined = combined.fillna(method='ffill')
        elif fill_method == 'zero':
            combined = combined.fillna(0)
        
        self.combined_data = combined
        return combined
    
    def get_correlation_matrix(self) -> pd.DataFrame:
        """Calculate correlation between all features."""
        if self.combined_data is None:
            self.combine()
        return self.combined_data.corr()

# Test
# aggregator = MultiSourceAggregator()
# aggregator.add_source('google', google_data)
# aggregator.add_source('reddit', reddit_data)
Solution 10.2
class MultiSourceAggregator:
    def __init__(self):
        self.sources = {}
        self.combined_data = None

    def add_source(self, name: str, data: pd.DataFrame, 
                   date_col: str = None):
        df = data.copy()

        if date_col and date_col in df.columns:
            df = df.set_index(date_col)

        df.columns = [f'{name}_{col}' for col in df.columns]

        self.sources[name] = df
        return self

    def combine(self, fill_method: str = 'ffill') -> pd.DataFrame:
        if not self.sources:
            return pd.DataFrame()

        combined = list(self.sources.values())[0].copy()

        for name, df in list(self.sources.items())[1:]:
            combined = combined.join(df, how='outer')

        if fill_method == 'ffill':
            combined = combined.fillna(method='ffill')
        elif fill_method == 'zero':
            combined = combined.fillna(0)

        self.combined_data = combined
        return combined

    def get_correlation_matrix(self) -> pd.DataFrame:
        if self.combined_data is None:
            self.combine()
        return self.combined_data.corr()

Section 3: Social Media Data

Processing and analyzing social media data for trading signals.

# Social Media Data Processing

class SocialMediaProcessor:
    """Process social media data for trading signals."""
    
    def __init__(self):
        self.ticker_patterns = {}
        
    def extract_tickers(self, text: str) -> List[str]:
        """Extract stock tickers from text."""
        import re
        
        # Pattern: $AAPL or $aapl
        cashtag_pattern = r'\$([A-Za-z]{1,5})'
        tickers = re.findall(cashtag_pattern, text)
        
        return [t.upper() for t in tickers]
    
    def calculate_engagement_score(self, likes: int, comments: int, 
                                    shares: int, followers: int) -> float:
        """Calculate normalized engagement score."""
        if followers == 0:
            return 0
        
        engagement = (likes + comments * 2 + shares * 3) / followers
        return min(engagement * 100, 100)  # Cap at 100
    
    def analyze_post_timing(self, timestamps: pd.Series) -> Dict:
        """Analyze timing patterns in posts."""
        timestamps = pd.to_datetime(timestamps)
        
        return {
            'posts_by_hour': timestamps.dt.hour.value_counts().to_dict(),
            'posts_by_day': timestamps.dt.dayofweek.value_counts().to_dict(),
            'peak_hour': timestamps.dt.hour.mode().iloc[0] if len(timestamps) > 0 else None,
            'weekend_ratio': (timestamps.dt.dayofweek >= 5).mean()
        }
    
    def calculate_velocity(self, counts: pd.Series, window: int = 24) -> pd.Series:
        """Calculate rate of change in social activity."""
        return counts.diff(window) / window
    
    def detect_anomalies(self, values: pd.Series, threshold: float = 2.0) -> pd.Series:
        """Detect unusual spikes in activity."""
        mean = values.rolling(20).mean()
        std = values.rolling(20).std()
        z_score = (values - mean) / std
        return z_score.abs() > threshold

# Test
processor = SocialMediaProcessor()

sample_text = "$AAPL is looking bullish! Also watching $TSLA and $MSFT for breakouts."
tickers = processor.extract_tickers(sample_text)
print(f"Extracted tickers: {tickers}")

engagement = processor.calculate_engagement_score(likes=500, comments=50, shares=25, followers=10000)
print(f"Engagement score: {engagement:.2f}")
# Simulate social media posts

def simulate_social_posts(ticker: str, n_posts: int = 1000) -> pd.DataFrame:
    """Simulate social media posts about a ticker."""
    np.random.seed(hash(ticker) % 2**32)
    
    # Generate timestamps over 30 days
    base_date = pd.Timestamp.today() - timedelta(days=30)
    timestamps = pd.to_datetime(
        base_date + pd.to_timedelta(np.random.uniform(0, 30*24*60, n_posts), unit='m')
    )
    
    # Simulate post metrics
    posts = pd.DataFrame({
        'timestamp': timestamps,
        'ticker': ticker,
        'likes': np.random.exponential(50, n_posts).astype(int),
        'comments': np.random.exponential(10, n_posts).astype(int),
        'shares': np.random.exponential(5, n_posts).astype(int),
        'followers': np.random.exponential(5000, n_posts).astype(int),
        'sentiment': np.random.normal(0.1, 0.5, n_posts)
    }).sort_values('timestamp')
    
    return posts

# Generate simulated posts
social_posts = simulate_social_posts("AAPL")
print(f"Simulated {len(social_posts)} social media posts")
print(social_posts.head())
# Aggregate social posts to daily features

def aggregate_social_to_daily(posts: pd.DataFrame) -> pd.DataFrame:
    """Aggregate social media posts to daily features."""
    posts = posts.copy()
    posts['date'] = posts['timestamp'].dt.date
    
    # Aggregate
    daily = posts.groupby('date').agg({
        'likes': ['sum', 'mean'],
        'comments': ['sum', 'mean'],
        'shares': ['sum', 'mean'],
        'sentiment': ['mean', 'std'],
        'timestamp': 'count'
    })
    
    # Flatten column names
    daily.columns = ['_'.join(col).strip() for col in daily.columns.values]
    daily = daily.rename(columns={'timestamp_count': 'post_count'})
    
    # Calculate engagement
    daily['total_engagement'] = daily['likes_sum'] + daily['comments_sum'] * 2 + daily['shares_sum'] * 3
    
    # Rolling features
    daily['engagement_ma5'] = daily['total_engagement'].rolling(5).mean()
    daily['sentiment_ma5'] = daily['sentiment_mean'].rolling(5).mean()
    daily['post_velocity'] = daily['post_count'].diff(1)
    
    daily.index = pd.to_datetime(daily.index)
    return daily.dropna()

# Aggregate
daily_social = aggregate_social_to_daily(social_posts)
print("Daily Social Features:")
print(daily_social.head())
# Exercise 10.3: Social Momentum Detector (Guided)

class SocialMomentumDetector:
    """
    Detect momentum in social media activity.
    """
    
    def __init__(self, short_window: int = 3, long_window: int = 10):
        self.short_window = short_window
        self.long_window = long_window
        
    def calculate_momentum(self, daily_data: pd.DataFrame) -> pd.DataFrame:
        """Calculate social momentum indicators."""
        df = daily_data.copy()
        
        # Volume momentum (post count)
        # TODO: Calculate short and long moving averages
        df['volume_ma_short'] = df['post_count'].______(self.short_window).______()
        df['volume_ma_long'] = df['post_count'].______(self.long_window).______()
        df['volume_momentum'] = df['volume_ma_short'] / df['volume_ma_long'] - 1
        
        # Engagement momentum
        df['eng_ma_short'] = df['total_engagement'].rolling(self.short_window).mean()
        df['eng_ma_long'] = df['total_engagement'].rolling(self.long_window).mean()
        df['engagement_momentum'] = df['eng_ma_short'] / df['eng_ma_long'] - 1
        
        # Sentiment momentum
        df['sent_ma_short'] = df['sentiment_mean'].rolling(self.short_window).mean()
        df['sent_ma_long'] = df['sentiment_mean'].rolling(self.long_window).mean()
        df['sentiment_momentum'] = df['sent_ma_short'] - df['sent_ma_long']
        
        return df
    
    def generate_signals(self, momentum_data: pd.DataFrame,
                         volume_threshold: float = 0.2,
                         sentiment_threshold: float = 0.1) -> pd.Series:
        """Generate trading signals from momentum."""
        signals = pd.Series(0, index=momentum_data.index)
        
        # Bullish: high volume momentum + positive sentiment momentum
        bullish = (
            (momentum_data['volume_momentum'] > volume_threshold) &
            (momentum_data['sentiment_momentum'] > sentiment_threshold)
        )
        signals[bullish] = 1
        
        # Bearish: high volume momentum + negative sentiment momentum
        bearish = (
            (momentum_data['volume_momentum'] > volume_threshold) &
            (momentum_data['sentiment_momentum'] < -sentiment_threshold)
        )
        signals[bearish] = -1
        
        return signals

# Test
# detector = SocialMomentumDetector()
# momentum_data = detector.calculate_momentum(daily_social)
Solution 10.3
class SocialMomentumDetector:
    def __init__(self, short_window: int = 3, long_window: int = 10):
        self.short_window = short_window
        self.long_window = long_window

    def calculate_momentum(self, daily_data: pd.DataFrame) -> pd.DataFrame:
        df = daily_data.copy()

        df['volume_ma_short'] = df['post_count'].rolling(self.short_window).mean()
        df['volume_ma_long'] = df['post_count'].rolling(self.long_window).mean()
        df['volume_momentum'] = df['volume_ma_short'] / df['volume_ma_long'] - 1

        df['eng_ma_short'] = df['total_engagement'].rolling(self.short_window).mean()
        df['eng_ma_long'] = df['total_engagement'].rolling(self.long_window).mean()
        df['engagement_momentum'] = df['eng_ma_short'] / df['eng_ma_long'] - 1

        df['sent_ma_short'] = df['sentiment_mean'].rolling(self.short_window).mean()
        df['sent_ma_long'] = df['sentiment_mean'].rolling(self.long_window).mean()
        df['sentiment_momentum'] = df['sent_ma_short'] - df['sent_ma_long']

        return df

    def generate_signals(self, momentum_data: pd.DataFrame,
                         volume_threshold: float = 0.2,
                         sentiment_threshold: float = 0.1) -> pd.Series:
        signals = pd.Series(0, index=momentum_data.index)

        bullish = (
            (momentum_data['volume_momentum'] > volume_threshold) &
            (momentum_data['sentiment_momentum'] > sentiment_threshold)
        )
        signals[bullish] = 1

        bearish = (
            (momentum_data['volume_momentum'] > volume_threshold) &
            (momentum_data['sentiment_momentum'] < -sentiment_threshold)
        )
        signals[bearish] = -1

        return signals

Section 4: Multi-Source Prediction Model

Building prediction models that combine multiple data sources.

# Multi-Source Feature Engineering

class MultiSourceFeatureEngine:
    """Create features from multiple alternative data sources."""
    
    def __init__(self):
        self.feature_names = []
        
    def create_price_features(self, price_df: pd.DataFrame) -> pd.DataFrame:
        """Create features from price data."""
        features = pd.DataFrame(index=price_df.index)
        
        features['returns'] = price_df['Close'].pct_change()
        features['volatility'] = features['returns'].rolling(20).std()
        
        for p in [5, 10, 20]:
            features[f'momentum_{p}'] = price_df['Close'].pct_change(p)
        
        features['volume_ratio'] = price_df['Volume'] / price_df['Volume'].rolling(20).mean()
        
        return features
    
    def create_social_features(self, social_df: pd.DataFrame) -> pd.DataFrame:
        """Create features from social media data."""
        features = pd.DataFrame(index=social_df.index)
        
        features['social_volume'] = social_df.get('post_count', 0)
        features['social_sentiment'] = social_df.get('sentiment_mean', 0)
        features['social_engagement'] = social_df.get('total_engagement', 0)
        
        # Normalize
        for col in features.columns:
            mean = features[col].rolling(20).mean()
            std = features[col].rolling(20).std()
            features[f'{col}_zscore'] = (features[col] - mean) / std
        
        return features
    
    def create_web_features(self, web_df: pd.DataFrame) -> pd.DataFrame:
        """Create features from web traffic data."""
        features = pd.DataFrame(index=web_df.index)
        
        for col in ['website_visits', 'google_trend_score', 'app_downloads']:
            if col in web_df.columns:
                features[col] = web_df[col]
                features[f'{col}_change'] = web_df[col].pct_change(5)
        
        return features
    
    def combine_all(self, price_df: pd.DataFrame, 
                    social_df: pd.DataFrame = None,
                    web_df: pd.DataFrame = None,
                    alt_data: Dict = None) -> pd.DataFrame:
        """Combine all feature sources."""
        # Start with price features
        combined = self.create_price_features(price_df)
        
        # Add social features
        if social_df is not None:
            social_features = self.create_social_features(social_df)
            combined = combined.join(social_features, how='left')
        
        # Add web features
        if web_df is not None:
            web_features = self.create_web_features(web_df)
            combined = combined.join(web_features, how='left')
        
        # Add other alt data
        if alt_data is not None:
            for source_name, df in alt_data.items():
                df.columns = [f'{source_name}_{col}' for col in df.columns]
                combined = combined.join(df, how='left')
        
        # Fill missing
        combined = combined.fillna(method='ffill').fillna(0)
        
        self.feature_names = combined.columns.tolist()
        
        return combined.dropna()

# Test
feature_engine = MultiSourceFeatureEngine()
print("MultiSourceFeatureEngine initialized")
# Build multi-source model

def build_multi_source_model(symbol: str = "AAPL") -> Dict:
    """Build and evaluate model with multiple data sources."""
    
    # Get price data
    ticker = yf.Ticker(symbol)
    price_df = ticker.history(period="1y")
    price_df.index = price_df.index.tz_localize(None)
    
    # Create target
    price_df['target'] = (price_df['Close'].pct_change().shift(-1) > 0).astype(int)
    
    # Generate simulated alt data
    alt_data = generate_simulated_alt_data(len(price_df), symbol)
    
    # Align alt data with price data
    for source in alt_data:
        alt_data[source].index = price_df.index[:len(alt_data[source])]
    
    # Create features
    feature_engine = MultiSourceFeatureEngine()
    
    # Model 1: Price only
    price_features = feature_engine.create_price_features(price_df)
    
    # Model 2: Price + Social
    combined_social = feature_engine.combine_all(price_df, social_df=alt_data['social'])
    
    # Model 3: All sources
    combined_all = feature_engine.combine_all(
        price_df,
        social_df=alt_data['social'],
        web_df=alt_data['web'],
        alt_data={'jobs': alt_data['jobs'], 'geo': alt_data['geo']}
    )
    
    results = {}
    scaler = StandardScaler()
    
    for name, features in [('price_only', price_features), 
                           ('price_social', combined_social),
                           ('all_sources', combined_all)]:
        # Align with target
        target = price_df['target'].loc[features.index]
        features = features.loc[target.index]
        
        features = features.dropna()
        target = target.loc[features.index]
        
        # Split
        split_idx = int(len(features) * 0.8)
        X_train, X_test = features[:split_idx], features[split_idx:]
        y_train, y_test = target[:split_idx], target[split_idx:]
        
        # Scale and train
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        model.fit(X_train_scaled, y_train)
        
        results[name] = {
            'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
            'n_features': len(features.columns),
            'feature_importance': dict(zip(features.columns, model.feature_importances_))
        }
    
    return results

# Build models
model_results = build_multi_source_model()

print("Multi-Source Model Results:")
print("=" * 50)
for name, result in model_results.items():
    print(f"\n{name}:")
    print(f"  Accuracy: {result['accuracy']:.2%}")
    print(f"  Features: {result['n_features']}")
    print(f"  Top 3 Features:")
    for feat, imp in sorted(result['feature_importance'].items(), key=lambda x: -x[1])[:3]:
        print(f"    {feat}: {imp:.4f}")
# Exercise 10.4: Complete Alt Data System (Open-ended)
#
# Build an AlternativeDataSystem class that:
# - Collects data from multiple simulated sources
# - Creates features from each source
# - Combines all sources with price data
# - Trains and evaluates prediction models
# - Compares value-add of each data source
# - Generates a report on feature importance
#
# Your implementation:
Solution 10.4
class AlternativeDataSystem:
    """Complete alternative data trading system."""

    def __init__(self, symbol: str = "AAPL"):
        self.symbol = symbol
        self.price_data = None
        self.alt_data = {}
        self.features = None
        self.models = {}
        self.results = {}

    def collect_data(self, period: str = "1y"):
        """Collect price and alternative data."""
        # Price data
        ticker = yf.Ticker(self.symbol)
        self.price_data = ticker.history(period=period)
        self.price_data.index = self.price_data.index.tz_localize(None)

        # Simulated alt data
        n_days = len(self.price_data)
        self.alt_data = generate_simulated_alt_data(n_days, self.symbol)

        # Align indices
        for source in self.alt_data:
            self.alt_data[source].index = self.price_data.index[:len(self.alt_data[source])]

        return self

    def create_features(self):
        """Create features from all sources."""
        engine = MultiSourceFeatureEngine()
        self.features = engine.combine_all(
            self.price_data,
            social_df=self.alt_data['social'],
            web_df=self.alt_data['web'],
            alt_data={'jobs': self.alt_data['jobs'], 'geo': self.alt_data['geo']}
        )
        return self

    def train_models(self, test_frac: float = 0.2):
        """Train models with different feature sets."""
        target = (self.price_data['Close'].pct_change().shift(-1) > 0).astype(int)
        target = target.loc[self.features.index]

        # Define feature sets
        price_cols = [c for c in self.features.columns if not any(
            s in c for s in ['social', 'web', 'jobs', 'geo'])]
        social_cols = [c for c in self.features.columns if 'social' in c]
        all_cols = self.features.columns.tolist()

        feature_sets = {
            'price_only': price_cols,
            'price_social': price_cols + social_cols,
            'all_sources': all_cols
        }

        split_idx = int(len(self.features) * (1 - test_frac))
        scaler = StandardScaler()

        for name, cols in feature_sets.items():
            X = self.features[cols]
            y = target

            X_train, X_test = X[:split_idx], X[split_idx:]
            y_train, y_test = y[:split_idx], y[split_idx:]

            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
            model.fit(X_train_scaled, y_train)

            self.models[name] = model
            self.results[name] = {
                'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
                'features': cols,
                'importance': dict(zip(cols, model.feature_importances_))
            }

        return self

    def compare_sources(self) -> pd.DataFrame:
        """Compare value of each data source."""
        rows = []
        base_acc = self.results['price_only']['accuracy']

        for name, result in self.results.items():
            rows.append({
                'Model': name,
                'Accuracy': result['accuracy'],
                'Improvement': result['accuracy'] - base_acc,
                'N_Features': len(result['features'])
            })

        return pd.DataFrame(rows)

    def get_top_features(self, n: int = 10) -> pd.DataFrame:
        """Get top features across all models."""
        all_importance = self.results['all_sources']['importance']
        return pd.DataFrame([
            {'feature': k, 'importance': v}
            for k, v in sorted(all_importance.items(), key=lambda x: -x[1])[:n]
        ])

    def generate_report(self) -> str:
        """Generate text report."""
        report = f"""Alternative Data System Report
================================
Symbol: {self.symbol}
Data Points: {len(self.features)}
Total Features: {len(self.features.columns)}

Model Comparison:
"""
        for name, result in self.results.items():
            report += f"  {name}: {result['accuracy']:.2%} ({len(result['features'])} features)\n"

        report += "\nTop Features:\n"
        for _, row in self.get_top_features(5).iterrows():
            report += f"  {row['feature']}: {row['importance']:.4f}\n"

        return report
# Exercise 10.5: Data Source Evaluator (Open-ended)
#
# Build a DataSourceEvaluator class that:
# - Tests predictive value of individual data sources
# - Uses ablation studies (removing one source at a time)
# - Calculates information ratio for each source
# - Ranks sources by value-add
# - Visualizes source contributions
#
# Your implementation:
Solution 10.5
class DataSourceEvaluator:
    """Evaluate predictive value of individual data sources."""

    def __init__(self, features: pd.DataFrame, target: pd.Series):
        self.features = features
        self.target = target
        self.source_results = {}

    def identify_sources(self) -> Dict[str, List[str]]:
        """Identify which features belong to which source."""
        sources = {'price': [], 'social': [], 'web': [], 'jobs': [], 'geo': []}

        for col in self.features.columns:
            assigned = False
            for source in ['social', 'web', 'jobs', 'geo']:
                if source in col.lower():
                    sources[source].append(col)
                    assigned = True
                    break
            if not assigned:
                sources['price'].append(col)

        return sources

    def evaluate_single_source(self, source_name: str, 
                                source_cols: List[str]) -> Dict:
        """Evaluate a single data source."""
        if not source_cols:
            return {'accuracy': 0, 'features': 0}

        X = self.features[source_cols]
        y = self.target

        split_idx = int(len(X) * 0.8)
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]

        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
        model.fit(X_train_scaled, y_train)

        return {
            'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
            'features': len(source_cols)
        }

    def ablation_study(self) -> pd.DataFrame:
        """Remove each source and measure impact."""
        sources = self.identify_sources()
        all_cols = self.features.columns.tolist()

        # Full model baseline
        full_result = self.evaluate_single_source('all', all_cols)

        results = [{'source': 'all', 'accuracy': full_result['accuracy'], 
                   'drop': 0, 'features': len(all_cols)}]

        for source, cols in sources.items():
            if cols:
                remaining_cols = [c for c in all_cols if c not in cols]
                if remaining_cols:
                    result = self.evaluate_single_source(f'without_{source}', remaining_cols)
                    results.append({
                        'source': f'without_{source}',
                        'accuracy': result['accuracy'],
                        'drop': full_result['accuracy'] - result['accuracy'],
                        'features': len(remaining_cols)
                    })

        return pd.DataFrame(results).sort_values('drop', ascending=False)

    def rank_sources(self) -> pd.DataFrame:
        """Rank data sources by value."""
        sources = self.identify_sources()

        results = []
        for source, cols in sources.items():
            result = self.evaluate_single_source(source, cols)
            results.append({
                'source': source,
                'accuracy': result['accuracy'],
                'features': result['features'],
                'acc_per_feature': result['accuracy'] / max(result['features'], 1)
            })

        return pd.DataFrame(results).sort_values('accuracy', ascending=False)

    def plot_contributions(self):
        """Visualize source contributions."""
        ranking = self.rank_sources()
        ablation = self.ablation_study()

        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        axes[0].bar(ranking['source'], ranking['accuracy'])
        axes[0].set_title('Accuracy by Single Source')
        axes[0].set_ylabel('Accuracy')
        axes[0].axhline(y=0.5, color='red', linestyle='--')
        axes[0].tick_params(axis='x', rotation=45)

        ablation_plot = ablation[ablation['source'] != 'all']
        colors = ['red' if d > 0 else 'green' for d in ablation_plot['drop']]
        axes[1].bar(ablation_plot['source'], ablation_plot['drop'], color=colors)
        axes[1].set_title('Accuracy Drop When Removing Source')
        axes[1].set_ylabel('Accuracy Drop')
        axes[1].tick_params(axis='x', rotation=45)

        plt.tight_layout()
        plt.show()
# Exercise 10.6: Production Alt Data Pipeline (Open-ended)
#
# Build a ProductionAltDataPipeline class that:
# - Simulates real-time data collection
# - Handles missing data and outliers
# - Updates features incrementally
# - Generates trading signals with confidence
# - Tracks data quality metrics
# - Provides alerting for data anomalies
#
# Your implementation:
Solution 10.6
class ProductionAltDataPipeline:
    """Production-ready alternative data pipeline."""

    def __init__(self):
        self.data_buffer = {}
        self.features = pd.DataFrame()
        self.model = None
        self.scaler = StandardScaler()
        self.quality_metrics = {}
        self.alerts = []

    def ingest_data(self, source: str, data: Dict, timestamp: datetime):
        """Ingest new data point."""
        if source not in self.data_buffer:
            self.data_buffer[source] = []

        data['timestamp'] = timestamp
        self.data_buffer[source].append(data)

        # Check for anomalies
        self._check_anomalies(source, data)

    def _check_anomalies(self, source: str, data: Dict):
        """Check for data anomalies."""
        if len(self.data_buffer[source]) < 10:
            return

        # Get recent values
        recent = pd.DataFrame(self.data_buffer[source][-20:])

        for col in recent.select_dtypes(include=[np.number]).columns:
            mean = recent[col].mean()
            std = recent[col].std()
            if std > 0:
                z_score = abs(data.get(col, mean) - mean) / std
                if z_score > 3:
                    self.alerts.append({
                        'timestamp': data['timestamp'],
                        'source': source,
                        'field': col,
                        'z_score': z_score,
                        'message': f'Anomaly detected in {source}.{col}'
                    })

    def update_features(self):
        """Update features from buffer."""
        if not self.data_buffer:
            return

        # Convert buffers to DataFrames
        dfs = {}
        for source, buffer in self.data_buffer.items():
            df = pd.DataFrame(buffer)
            df = df.set_index('timestamp')
            df.columns = [f'{source}_{c}' for c in df.columns]
            dfs[source] = df

        # Combine
        if dfs:
            combined = list(dfs.values())[0]
            for df in list(dfs.values())[1:]:
                combined = combined.join(df, how='outer')
            self.features = combined.fillna(method='ffill')

    def handle_missing_data(self, method: str = 'ffill'):
        """Handle missing data."""
        before = self.features.isna().sum().sum()

        if method == 'ffill':
            self.features = self.features.fillna(method='ffill')
        elif method == 'interpolate':
            self.features = self.features.interpolate()
        elif method == 'zero':
            self.features = self.features.fillna(0)

        after = self.features.isna().sum().sum()
        self.quality_metrics['missing_filled'] = before - after

    def generate_signal(self) -> Dict:
        """Generate trading signal from latest features."""
        if self.model is None or self.features.empty:
            return {'signal': 0, 'confidence': 0}

        latest = self.features.iloc[-1:]
        latest_scaled = self.scaler.transform(latest)

        prediction = self.model.predict(latest_scaled)[0]
        proba = self.model.predict_proba(latest_scaled)[0]
        confidence = max(proba)

        return {
            'signal': prediction,
            'signal_name': 'BUY' if prediction == 1 else 'SELL',
            'confidence': confidence,
            'timestamp': self.features.index[-1]
        }

    def get_quality_report(self) -> Dict:
        """Generate data quality report."""
        return {
            'sources': list(self.data_buffer.keys()),
            'total_records': sum(len(b) for b in self.data_buffer.values()),
            'feature_count': len(self.features.columns),
            'missing_values': self.features.isna().sum().to_dict(),
            'alerts': len(self.alerts),
            'recent_alerts': self.alerts[-5:] if self.alerts else []
        }

    def train_model(self, target: pd.Series):
        """Train prediction model."""
        aligned = self.features.loc[target.index].dropna()
        target = target.loc[aligned.index]

        X_scaled = self.scaler.fit_transform(aligned)

        self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        self.model.fit(X_scaled, target)

        return self

Module Project: Complete Alternative Data Trading System

Build a comprehensive system combining multiple alternative data sources.

class AltDataTradingSystem:
    """
    Complete alternative data trading system.
    
    Combines social, web, job, and geospatial data
    with price data for trading signals.
    """
    
    def __init__(self, symbol: str = "AAPL"):
        self.symbol = symbol
        self.price_data = None
        self.alt_data = {}
        self.features = None
        self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
        self.scaler = StandardScaler()
        
    def load_data(self, period: str = "1y"):
        """Load price and alternative data."""
        # Price data
        ticker = yf.Ticker(self.symbol)
        self.price_data = ticker.history(period=period)
        self.price_data.index = self.price_data.index.tz_localize(None)
        
        # Target
        self.price_data['target'] = (self.price_data['Close'].pct_change().shift(-1) > 0).astype(int)
        
        # Simulated alt data
        self.alt_data = generate_simulated_alt_data(len(self.price_data), self.symbol)
        
        # Align indices
        for source in self.alt_data:
            self.alt_data[source].index = self.price_data.index[:len(self.alt_data[source])]
        
        print(f"Loaded {len(self.price_data)} days of price data")
        print(f"Alt data sources: {list(self.alt_data.keys())}")
        
        return self
    
    def create_features(self):
        """Create comprehensive feature set."""
        features = pd.DataFrame(index=self.price_data.index)
        
        # Price features
        features['returns'] = self.price_data['Close'].pct_change()
        features['volatility'] = features['returns'].rolling(20).std()
        for p in [5, 10, 20]:
            features[f'momentum_{p}'] = self.price_data['Close'].pct_change(p)
        features['volume_ratio'] = self.price_data['Volume'] / self.price_data['Volume'].rolling(20).mean()
        
        # Social features
        social = self.alt_data['social']
        features['social_sentiment'] = social['twitter_sentiment']
        features['social_volume'] = social['twitter_mentions']
        features['bullish_ratio'] = social['stocktwits_bullish_pct']
        features['social_sentiment_ma5'] = features['social_sentiment'].rolling(5).mean()
        features['social_momentum'] = features['social_sentiment'] - features['social_sentiment_ma5']
        
        # Web features
        web = self.alt_data['web']
        features['web_traffic'] = web['website_visits']
        features['traffic_change'] = web['website_visits'].pct_change(5)
        features['google_trends'] = web['google_trend_score']
        
        # Job features
        jobs = self.alt_data['jobs']
        features['job_growth'] = jobs['job_postings'].pct_change(20)
        features['eng_ratio'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)
        
        # Geo features
        geo = self.alt_data['geo']
        features['foot_traffic'] = geo['store_foot_traffic']
        features['parking_fill'] = geo['parking_lot_fill']
        
        self.features = features.dropna()
        self.feature_names = self.features.columns.tolist()
        
        print(f"Created {len(self.feature_names)} features")
        
        return self
    
    def fit(self, test_frac: float = 0.2):
        """Train the model."""
        target = self.price_data['target'].loc[self.features.index]
        
        split_idx = int(len(self.features) * (1 - test_frac))
        self.X_train = self.features[:split_idx]
        self.X_test = self.features[split_idx:]
        self.y_train = target[:split_idx]
        self.y_test = target[split_idx:]
        self.test_returns = self.price_data['Close'].pct_change()[split_idx:]
        
        X_train_scaled = self.scaler.fit_transform(self.X_train)
        self.model.fit(X_train_scaled, self.y_train)
        
        return self
    
    def evaluate(self) -> Dict:
        """Evaluate model performance."""
        X_test_scaled = self.scaler.transform(self.X_test)
        y_pred = self.model.predict(X_test_scaled)
        
        # Classification metrics
        accuracy = accuracy_score(self.y_test, y_pred)
        
        # Financial metrics
        pred_series = pd.Series(y_pred, index=self.y_test.index)
        test_returns = self.test_returns.loc[pred_series.index]
        strategy_returns = pred_series.shift(1) * test_returns
        strategy_returns = strategy_returns.dropna()
        
        total_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
        bh_return = (1 + test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
        sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
        
        return {
            'accuracy': accuracy,
            'total_return': total_return,
            'buy_hold_return': bh_return,
            'outperformance': total_return - bh_return,
            'sharpe_ratio': sharpe,
            'feature_importance': dict(zip(self.feature_names, self.model.feature_importances_))
        }
    
    def plot_results(self):
        """Visualize results."""
        X_test_scaled = self.scaler.transform(self.X_test)
        y_pred = self.model.predict(X_test_scaled)
        
        pred_series = pd.Series(y_pred, index=self.y_test.index)
        test_returns = self.test_returns.loc[pred_series.index]
        strategy_returns = pred_series.shift(1) * test_returns
        
        cum_strategy = (1 + strategy_returns.fillna(0)).cumprod()
        cum_bh = (1 + test_returns.fillna(0)).cumprod()
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Cumulative returns
        axes[0, 0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
        axes[0, 0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
        axes[0, 0].set_ylabel('Cumulative Return')
        axes[0, 0].set_title('Alt Data Strategy Performance')
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # Feature importance
        importance = pd.Series(
            self.model.feature_importances_,
            index=self.feature_names
        ).sort_values(ascending=True).tail(10)
        
        axes[0, 1].barh(importance.index, importance.values, color='steelblue')
        axes[0, 1].set_xlabel('Importance')
        axes[0, 1].set_title('Top 10 Feature Importance')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Social sentiment vs returns
        axes[1, 0].scatter(self.features['social_sentiment'].loc[test_returns.index],
                          test_returns, alpha=0.5)
        axes[1, 0].set_xlabel('Social Sentiment')
        axes[1, 0].set_ylabel('Next Day Return')
        axes[1, 0].set_title('Sentiment vs Returns')
        axes[1, 0].axhline(y=0, color='red', linestyle='--')
        axes[1, 0].axvline(x=0, color='red', linestyle='--')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Data source contribution
        sources = {'price': 0, 'social': 0, 'web': 0, 'job': 0, 'geo': 0}
        for feat, imp in zip(self.feature_names, self.model.feature_importances_):
            if 'social' in feat or 'bullish' in feat:
                sources['social'] += imp
            elif 'web' in feat or 'traffic' in feat or 'google' in feat:
                sources['web'] += imp
            elif 'job' in feat or 'eng' in feat:
                sources['job'] += imp
            elif 'foot' in feat or 'parking' in feat:
                sources['geo'] += imp
            else:
                sources['price'] += imp
        
        axes[1, 1].pie(sources.values(), labels=sources.keys(), autopct='%1.1f%%')
        axes[1, 1].set_title('Feature Importance by Source')
        
        plt.tight_layout()
        plt.show()
# Run the complete system

system = AltDataTradingSystem("AAPL")
system.load_data()
system.create_features()
system.fit()

# Evaluate
results = system.evaluate()

print("\n" + "="*50)
print("ALT DATA TRADING SYSTEM RESULTS")
print("="*50)
print(f"\nAccuracy: {results['accuracy']:.2%}")
print(f"\nStrategy Return: {results['total_return']:.2%}")
print(f"Buy & Hold Return: {results['buy_hold_return']:.2%}")
print(f"Outperformance: {results['outperformance']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"\nTop 5 Features:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1])[:5]:
    print(f"  {feat}: {imp:.4f}")
# Visualize

system.plot_results()

Key Takeaways

  1. Alternative data provides unique signals beyond price and volume

  2. Data collection requires attention to rate limits, caching, and error handling

  3. Social media data can capture market sentiment in real-time

  4. Multi-source models often outperform single-source models

  5. Data quality monitoring is essential for production systems

  6. Ablation studies help identify which sources provide the most value

  7. Alpha decay means alternative data edges diminish as more traders use them


Next: Module 11 - Deep Learning for Finance (Neural networks, LSTM, transformers)

Module 11: Deep Learning for Finance

Overview

Deep learning has revolutionized many fields, and finance is no exception. This module covers neural network architectures particularly suited for financial applications, from basic feedforward networks to advanced sequence models like LSTMs and Transformers.

Learning Objectives

By the end of this module, you will be able to: - Build and train neural networks for financial prediction - Implement LSTM networks for time series forecasting - Apply attention mechanisms and Transformers to market data - Design appropriate architectures for different financial tasks

Prerequisites

  • Module 6: Other Classification Models (Neural Network basics)
  • Module 8: Regression Models
  • Understanding of backpropagation and gradient descent

Estimated Time: 4 hours


Section 1: Neural Network Fundamentals for Finance

Neural networks can capture complex non-linear relationships in financial data that traditional models miss.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Deep learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error

np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print("Deep learning libraries loaded successfully")
# Generate synthetic financial data
def generate_financial_data(n_samples=2000):
    """Generate synthetic stock data with realistic patterns."""
    np.random.seed(42)
    
    dates = pd.date_range(start='2018-01-01', periods=n_samples, freq='D')
    
    # Generate price series with trend and volatility clustering
    returns = np.random.normal(0.0003, 0.015, n_samples)
    
    # Add volatility clustering (GARCH-like effect)
    volatility = np.ones(n_samples) * 0.015
    for i in range(1, n_samples):
        volatility[i] = 0.9 * volatility[i-1] + 0.1 * abs(returns[i-1]) * 2
        returns[i] = np.random.normal(0.0003, volatility[i])
    
    # Generate prices
    prices = 100 * np.exp(np.cumsum(returns))
    
    # Create OHLCV data
    high = prices * (1 + np.abs(np.random.normal(0, 0.01, n_samples)))
    low = prices * (1 - np.abs(np.random.normal(0, 0.01, n_samples)))
    volume = np.random.lognormal(15, 0.5, n_samples)
    
    df = pd.DataFrame({
        'date': dates,
        'open': np.roll(prices, 1),
        'high': high,
        'low': low,
        'close': prices,
        'volume': volume
    })
    df.loc[0, 'open'] = df.loc[0, 'close']
    df.set_index('date', inplace=True)
    
    return df

# Generate data
df = generate_financial_data(2000)
print(f"Dataset shape: {df.shape}")
df.tail()
# Feature engineering for neural networks
def create_nn_features(df, lookback_periods=[5, 10, 20, 50]):
    """Create features suitable for neural network input."""
    data = df.copy()
    
    # Returns at different horizons
    data['return_1d'] = data['close'].pct_change()
    data['return_5d'] = data['close'].pct_change(5)
    data['return_20d'] = data['close'].pct_change(20)
    
    # Volatility features
    for period in lookback_periods:
        data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
        data[f'sma_{period}'] = data['close'].rolling(period).mean()
        data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
    
    # RSI
    delta = data['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    data['rsi'] = 100 - (100 / (1 + rs))
    
    # MACD
    exp12 = data['close'].ewm(span=12).mean()
    exp26 = data['close'].ewm(span=26).mean()
    data['macd'] = exp12 - exp26
    data['macd_signal'] = data['macd'].ewm(span=9).mean()
    data['macd_hist'] = data['macd'] - data['macd_signal']
    
    # Volume features
    data['volume_sma_20'] = data['volume'].rolling(20).mean()
    data['volume_ratio'] = data['volume'] / data['volume_sma_20']
    
    # Price range
    data['daily_range'] = (data['high'] - data['low']) / data['close']
    data['avg_range_20'] = data['daily_range'].rolling(20).mean()
    
    # Target: Next day return direction
    data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
    data['target_return'] = data['close'].pct_change().shift(-1)
    
    return data.dropna()

# Create features
df_features = create_nn_features(df)
print(f"Features created: {df_features.shape[1]} columns")
print(f"Samples after processing: {len(df_features)}")
# Building a Feedforward Neural Network for classification
class FinancialNeuralNetwork:
    """Neural network for financial prediction tasks."""
    
    def __init__(self, input_dim, hidden_layers=[64, 32, 16], 
                 dropout_rate=0.3, learning_rate=0.001):
        self.input_dim = input_dim
        self.hidden_layers = hidden_layers
        self.dropout_rate = dropout_rate
        self.learning_rate = learning_rate
        self.model = None
        self.scaler = StandardScaler()
        self.history = None
        
    def build_classifier(self):
        """Build classification neural network."""
        model = Sequential()
        
        # Input layer
        model.add(layers.Dense(self.hidden_layers[0], 
                               input_dim=self.input_dim,
                               activation='relu',
                               kernel_regularizer=keras.regularizers.l2(0.01)))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(self.dropout_rate))
        
        # Hidden layers
        for units in self.hidden_layers[1:]:
            model.add(layers.Dense(units, activation='relu',
                                   kernel_regularizer=keras.regularizers.l2(0.01)))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(self.dropout_rate))
        
        # Output layer
        model.add(layers.Dense(1, activation='sigmoid'))
        
        model.compile(
            optimizer=Adam(learning_rate=self.learning_rate),
            loss='binary_crossentropy',
            metrics=['accuracy']
        )
        
        self.model = model
        return model
    
    def build_regressor(self):
        """Build regression neural network."""
        model = Sequential()
        
        # Input layer
        model.add(layers.Dense(self.hidden_layers[0], 
                               input_dim=self.input_dim,
                               activation='relu',
                               kernel_regularizer=keras.regularizers.l2(0.01)))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(self.dropout_rate))
        
        # Hidden layers
        for units in self.hidden_layers[1:]:
            model.add(layers.Dense(units, activation='relu',
                                   kernel_regularizer=keras.regularizers.l2(0.01)))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(self.dropout_rate))
        
        # Output layer (linear for regression)
        model.add(layers.Dense(1, activation='linear'))
        
        model.compile(
            optimizer=Adam(learning_rate=self.learning_rate),
            loss='mse',
            metrics=['mae']
        )
        
        self.model = model
        return model
    
    def prepare_data(self, X, fit_scaler=True):
        """Scale features for neural network."""
        if fit_scaler:
            return self.scaler.fit_transform(X)
        return self.scaler.transform(X)
    
    def train(self, X_train, y_train, X_val=None, y_val=None,
              epochs=100, batch_size=32, verbose=1):
        """Train the neural network with early stopping."""
        callbacks = [
            EarlyStopping(monitor='val_loss', patience=10, 
                          restore_best_weights=True),
            ReduceLROnPlateau(monitor='val_loss', factor=0.5, 
                              patience=5, min_lr=1e-6)
        ]
        
        validation_data = None
        if X_val is not None and y_val is not None:
            validation_data = (X_val, y_val)
        
        self.history = self.model.fit(
            X_train, y_train,
            validation_data=validation_data,
            epochs=epochs,
            batch_size=batch_size,
            callbacks=callbacks,
            verbose=verbose
        )
        
        return self.history
    
    def predict(self, X):
        """Make predictions."""
        return self.model.predict(X, verbose=0)
    
    def predict_classes(self, X, threshold=0.5):
        """Predict class labels."""
        probs = self.predict(X)
        return (probs >= threshold).astype(int).flatten()

print("FinancialNeuralNetwork class defined")
# Prepare data for classification
feature_cols = ['return_1d', 'return_5d', 'return_20d',
                'volatility_5d', 'volatility_10d', 'volatility_20d',
                'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20', 'price_to_sma_50',
                'rsi', 'macd_hist', 'volume_ratio', 'daily_range']

X = df_features[feature_cols].values
y = df_features['target'].values

# Time-based split (no shuffling for time series)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Further split training for validation
val_idx = int(len(X_train) * 0.8)
X_train_nn, X_val = X_train[:val_idx], X_train[val_idx:]
y_train_nn, y_val = y_train[:val_idx], y_train[val_idx:]

# Build and train classifier
nn_classifier = FinancialNeuralNetwork(
    input_dim=len(feature_cols),
    hidden_layers=[64, 32, 16],
    dropout_rate=0.3
)
nn_classifier.build_classifier()

# Scale data
X_train_scaled = nn_classifier.prepare_data(X_train_nn, fit_scaler=True)
X_val_scaled = nn_classifier.prepare_data(X_val, fit_scaler=False)
X_test_scaled = nn_classifier.prepare_data(X_test, fit_scaler=False)

# Train model
print("Training neural network classifier...")
history = nn_classifier.train(
    X_train_scaled, y_train_nn,
    X_val_scaled, y_val,
    epochs=50,
    batch_size=32,
    verbose=0
)

# Evaluate
train_pred = nn_classifier.predict_classes(X_train_scaled)
test_pred = nn_classifier.predict_classes(X_test_scaled)

print(f"\nTraining accuracy: {accuracy_score(y_train_nn, train_pred):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, test_pred):.4f}")
# Visualize training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss
axes[0].plot(history.history['loss'], label='Training')
axes[0].plot(history.history['val_loss'], label='Validation')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Model Loss')
axes[0].legend()

# Accuracy
axes[1].plot(history.history['accuracy'], label='Training')
axes[1].plot(history.history['val_accuracy'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Model Accuracy')
axes[1].legend()

plt.tight_layout()
plt.show()

Section 2: LSTM Networks for Time Series

Long Short-Term Memory (LSTM) networks are designed to capture long-term dependencies in sequential data, making them ideal for financial time series.

# LSTM for Financial Time Series
class FinancialLSTM:
    """LSTM network for financial time series prediction."""
    
    def __init__(self, sequence_length=20, n_features=1,
                 lstm_units=[64, 32], dropout_rate=0.2):
        self.sequence_length = sequence_length
        self.n_features = n_features
        self.lstm_units = lstm_units
        self.dropout_rate = dropout_rate
        self.model = None
        self.scaler = MinMaxScaler()
        
    def create_sequences(self, data, target_col_idx=-1):
        """Create sequences for LSTM input."""
        X, y = [], []
        
        for i in range(self.sequence_length, len(data)):
            X.append(data[i-self.sequence_length:i])
            y.append(data[i, target_col_idx])
        
        return np.array(X), np.array(y)
    
    def build_model(self, output_type='regression'):
        """Build LSTM model."""
        model = Sequential()
        
        # First LSTM layer
        model.add(layers.LSTM(
            self.lstm_units[0],
            return_sequences=len(self.lstm_units) > 1,
            input_shape=(self.sequence_length, self.n_features)
        ))
        model.add(layers.Dropout(self.dropout_rate))
        
        # Additional LSTM layers
        for i, units in enumerate(self.lstm_units[1:]):
            return_seq = i < len(self.lstm_units) - 2
            model.add(layers.LSTM(units, return_sequences=return_seq))
            model.add(layers.Dropout(self.dropout_rate))
        
        # Dense layers
        model.add(layers.Dense(16, activation='relu'))
        
        # Output layer
        if output_type == 'regression':
            model.add(layers.Dense(1, activation='linear'))
            model.compile(optimizer=Adam(learning_rate=0.001),
                          loss='mse', metrics=['mae'])
        else:
            model.add(layers.Dense(1, activation='sigmoid'))
            model.compile(optimizer=Adam(learning_rate=0.001),
                          loss='binary_crossentropy', metrics=['accuracy'])
        
        self.model = model
        return model
    
    def prepare_data(self, df, feature_cols, target_col):
        """Prepare data for LSTM training."""
        # Get features and target
        features = df[feature_cols].values
        
        # Scale features
        scaled_features = self.scaler.fit_transform(features)
        
        # Create sequences
        X, y = self.create_sequences(scaled_features, 
                                      target_col_idx=feature_cols.index(target_col))
        
        return X, y
    
    def train(self, X_train, y_train, X_val=None, y_val=None,
              epochs=50, batch_size=32, verbose=1):
        """Train LSTM model."""
        callbacks = [
            EarlyStopping(monitor='val_loss', patience=10,
                          restore_best_weights=True),
            ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                              patience=5, min_lr=1e-6)
        ]
        
        validation_data = None
        if X_val is not None:
            validation_data = (X_val, y_val)
        
        history = self.model.fit(
            X_train, y_train,
            validation_data=validation_data,
            epochs=epochs,
            batch_size=batch_size,
            callbacks=callbacks,
            verbose=verbose
        )
        
        return history
    
    def predict(self, X):
        """Make predictions."""
        return self.model.predict(X, verbose=0)

print("FinancialLSTM class defined")
# Prepare data for LSTM
lstm_features = ['return_1d', 'volatility_10d', 'rsi', 'macd_hist', 'volume_ratio']

# Add return as target (shifted for prediction)
df_lstm = df_features[lstm_features].copy()
df_lstm['target_return'] = df_features['return_1d'].shift(-1)
df_lstm = df_lstm.dropna()

# Scale all features
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_lstm.values)

# Create sequences
sequence_length = 20
X_seq, y_seq = [], []

for i in range(sequence_length, len(scaled_data)):
    X_seq.append(scaled_data[i-sequence_length:i, :-1])  # All features except target
    y_seq.append(scaled_data[i, -1])  # Target return

X_seq = np.array(X_seq)
y_seq = np.array(y_seq)

print(f"Sequence shape: {X_seq.shape}")
print(f"Target shape: {y_seq.shape}")
# Train LSTM model
# Time-based split
split_idx = int(len(X_seq) * 0.8)
X_train_lstm = X_seq[:split_idx]
X_test_lstm = X_seq[split_idx:]
y_train_lstm = y_seq[:split_idx]
y_test_lstm = y_seq[split_idx:]

# Validation split
val_idx = int(len(X_train_lstm) * 0.8)
X_train_l, X_val_l = X_train_lstm[:val_idx], X_train_lstm[val_idx:]
y_train_l, y_val_l = y_train_lstm[:val_idx], y_train_lstm[val_idx:]

# Build LSTM
lstm_model = Sequential([
    layers.LSTM(64, return_sequences=True, 
                input_shape=(sequence_length, len(lstm_features))),
    layers.Dropout(0.2),
    layers.LSTM(32, return_sequences=False),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='linear')
])

lstm_model.compile(optimizer=Adam(learning_rate=0.001),
                   loss='mse', metrics=['mae'])

print("Training LSTM model...")
lstm_history = lstm_model.fit(
    X_train_l, y_train_l,
    validation_data=(X_val_l, y_val_l),
    epochs=50,
    batch_size=32,
    callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
    verbose=0
)

# Evaluate
train_pred_lstm = lstm_model.predict(X_train_l, verbose=0)
test_pred_lstm = lstm_model.predict(X_test_lstm, verbose=0)

print(f"\nTraining MSE: {mean_squared_error(y_train_l, train_pred_lstm):.6f}")
print(f"Test MSE: {mean_squared_error(y_test_lstm, test_pred_lstm):.6f}")
print(f"Test MAE: {mean_absolute_error(y_test_lstm, test_pred_lstm):.6f}")
# Visualize LSTM predictions
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Training history
axes[0].plot(lstm_history.history['loss'], label='Training Loss')
axes[0].plot(lstm_history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('LSTM Training History')
axes[0].legend()

# Predictions vs Actual
test_range = range(len(test_pred_lstm))
axes[1].plot(test_range, y_test_lstm, label='Actual', alpha=0.7)
axes[1].plot(test_range, test_pred_lstm, label='Predicted', alpha=0.7)
axes[1].set_xlabel('Time Step')
axes[1].set_ylabel('Scaled Return')
axes[1].set_title('LSTM Predictions vs Actual (Test Set)')
axes[1].legend()

plt.tight_layout()
plt.show()

Section 3: Attention Mechanisms and Transformers

Transformers use attention mechanisms to capture relationships across all time steps simultaneously, often outperforming LSTMs on financial data.

# Custom Attention Layer
class AttentionLayer(layers.Layer):
    """Simple attention mechanism for time series."""
    
    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)
        
    def build(self, input_shape):
        self.W = self.add_weight(
            name='attention_weight',
            shape=(input_shape[-1], 1),
            initializer='glorot_uniform',
            trainable=True
        )
        self.b = self.add_weight(
            name='attention_bias',
            shape=(input_shape[1], 1),
            initializer='zeros',
            trainable=True
        )
        super(AttentionLayer, self).build(input_shape)
        
    def call(self, x):
        # Compute attention scores
        e = tf.nn.tanh(tf.tensordot(x, self.W, axes=1) + self.b)
        a = tf.nn.softmax(e, axis=1)
        
        # Apply attention weights
        output = tf.reduce_sum(x * a, axis=1)
        return output
    
    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[-1])

print("AttentionLayer defined")
# Transformer Block for Financial Data
class TransformerBlock(layers.Layer):
    """Transformer block with multi-head attention."""
    
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.ff_dim = ff_dim
        
        self.att = layers.MultiHeadAttention(
            num_heads=num_heads, 
            key_dim=embed_dim
        )
        self.ffn = Sequential([
            layers.Dense(ff_dim, activation='relu'),
            layers.Dense(embed_dim)
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)
        
    def call(self, inputs, training=False):
        # Multi-head self-attention
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        
        # Feed-forward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

print("TransformerBlock defined")
# Positional Encoding for Transformers
class PositionalEncoding(layers.Layer):
    """Add positional information to embeddings."""
    
    def __init__(self, sequence_length, embed_dim):
        super(PositionalEncoding, self).__init__()
        self.sequence_length = sequence_length
        self.embed_dim = embed_dim
        
    def build(self, input_shape):
        # Create positional encoding matrix
        position = np.arange(self.sequence_length)[:, np.newaxis]
        div_term = np.exp(np.arange(0, self.embed_dim, 2) * 
                          -(np.log(10000.0) / self.embed_dim))
        
        pe = np.zeros((self.sequence_length, self.embed_dim))
        pe[:, 0::2] = np.sin(position * div_term)
        if self.embed_dim > 1:
            pe[:, 1::2] = np.cos(position * div_term[:self.embed_dim//2])
        
        self.pe = tf.constant(pe, dtype=tf.float32)
        super(PositionalEncoding, self).build(input_shape)
        
    def call(self, x):
        return x + self.pe

print("PositionalEncoding defined")
# Build Financial Transformer Model
def build_financial_transformer(sequence_length, n_features, 
                                 embed_dim=32, num_heads=4, 
                                 ff_dim=64, num_blocks=2,
                                 output_type='regression'):
    """Build a Transformer model for financial prediction."""
    
    inputs = layers.Input(shape=(sequence_length, n_features))
    
    # Project input to embedding dimension
    x = layers.Dense(embed_dim)(inputs)
    
    # Add positional encoding
    x = PositionalEncoding(sequence_length, embed_dim)(x)
    
    # Transformer blocks
    for _ in range(num_blocks):
        x = TransformerBlock(embed_dim, num_heads, ff_dim, dropout_rate=0.1)(x)
    
    # Global average pooling
    x = layers.GlobalAveragePooling1D()(x)
    
    # Dense layers
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dropout(0.2)(x)
    
    # Output layer
    if output_type == 'regression':
        outputs = layers.Dense(1, activation='linear')(x)
        model = Model(inputs, outputs)
        model.compile(optimizer=Adam(learning_rate=0.001),
                      loss='mse', metrics=['mae'])
    else:
        outputs = layers.Dense(1, activation='sigmoid')(x)
        model = Model(inputs, outputs)
        model.compile(optimizer=Adam(learning_rate=0.001),
                      loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Build and train transformer
transformer_model = build_financial_transformer(
    sequence_length=20,
    n_features=len(lstm_features),
    embed_dim=32,
    num_heads=4,
    ff_dim=64,
    num_blocks=2
)

print("Financial Transformer built")
transformer_model.summary()
# Train transformer model
print("Training Transformer model...")
transformer_history = transformer_model.fit(
    X_train_l, y_train_l,
    validation_data=(X_val_l, y_val_l),
    epochs=50,
    batch_size=32,
    callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
    verbose=0
)

# Evaluate
transformer_pred = transformer_model.predict(X_test_lstm, verbose=0)

print(f"\nTransformer Test MSE: {mean_squared_error(y_test_lstm, transformer_pred):.6f}")
print(f"Transformer Test MAE: {mean_absolute_error(y_test_lstm, transformer_pred):.6f}")
print(f"\nLSTM Test MSE: {mean_squared_error(y_test_lstm, test_pred_lstm):.6f}")
print(f"LSTM Test MAE: {mean_absolute_error(y_test_lstm, test_pred_lstm):.6f}")

Section 4: Deep Learning Architecture Design

Designing the right architecture is crucial for financial applications. This section covers best practices and common patterns.

# Multi-Input Deep Learning Model
class MultiInputFinanceModel:
    """Deep learning model with multiple input branches."""
    
    def __init__(self, sequence_length=20, n_price_features=5,
                 n_fundamental_features=10):
        self.sequence_length = sequence_length
        self.n_price_features = n_price_features
        self.n_fundamental_features = n_fundamental_features
        self.model = None
        
    def build_model(self):
        """Build multi-input model with LSTM and dense branches."""
        
        # Price time series input (LSTM branch)
        price_input = layers.Input(
            shape=(self.sequence_length, self.n_price_features),
            name='price_input'
        )
        
        lstm_out = layers.LSTM(64, return_sequences=True)(price_input)
        lstm_out = layers.Dropout(0.2)(lstm_out)
        lstm_out = layers.LSTM(32)(lstm_out)
        lstm_out = layers.Dropout(0.2)(lstm_out)
        lstm_out = layers.Dense(16, activation='relu')(lstm_out)
        
        # Fundamental features input (Dense branch)
        fundamental_input = layers.Input(
            shape=(self.n_fundamental_features,),
            name='fundamental_input'
        )
        
        dense_out = layers.Dense(32, activation='relu')(fundamental_input)
        dense_out = layers.BatchNormalization()(dense_out)
        dense_out = layers.Dropout(0.3)(dense_out)
        dense_out = layers.Dense(16, activation='relu')(dense_out)
        
        # Merge branches
        merged = layers.Concatenate()([lstm_out, dense_out])
        
        # Final layers
        x = layers.Dense(32, activation='relu')(merged)
        x = layers.Dropout(0.2)(x)
        x = layers.Dense(16, activation='relu')(x)
        
        # Output
        output = layers.Dense(1, activation='sigmoid', name='output')(x)
        
        self.model = Model(
            inputs=[price_input, fundamental_input],
            outputs=output
        )
        
        self.model.compile(
            optimizer=Adam(learning_rate=0.001),
            loss='binary_crossentropy',
            metrics=['accuracy']
        )
        
        return self.model

# Create multi-input model
multi_model = MultiInputFinanceModel(
    sequence_length=20,
    n_price_features=5,
    n_fundamental_features=10
)
multi_model.build_model()

print("Multi-Input Model Architecture:")
multi_model.model.summary()
# Residual Network for Finance
def residual_block(x, units, dropout_rate=0.2):
    """Create a residual block with skip connection."""
    # Main path
    shortcut = x
    
    x = layers.Dense(units, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(dropout_rate)(x)
    x = layers.Dense(units, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    
    # Skip connection
    if shortcut.shape[-1] != units:
        shortcut = layers.Dense(units)(shortcut)
    
    x = layers.Add()([x, shortcut])
    x = layers.Activation('relu')(x)
    x = layers.Dropout(dropout_rate)(x)
    
    return x

def build_resnet_finance(input_dim, hidden_units=[64, 32, 16]):
    """Build a ResNet-style model for financial data."""
    inputs = layers.Input(shape=(input_dim,))
    
    x = layers.Dense(hidden_units[0], activation='relu')(inputs)
    x = layers.BatchNormalization()(x)
    
    for units in hidden_units:
        x = residual_block(x, units)
    
    x = layers.Dense(16, activation='relu')(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs, outputs)
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Build ResNet model
resnet_model = build_resnet_finance(len(feature_cols), hidden_units=[64, 32, 32, 16])
print("ResNet-style Financial Model:")
resnet_model.summary()
# Compare architectures
print("Training ResNet model for comparison...")
resnet_history = resnet_model.fit(
    X_train_scaled, y_train_nn,
    validation_data=(X_val_scaled, y_val),
    epochs=50,
    batch_size=32,
    callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
    verbose=0
)

# Evaluate
resnet_pred = (resnet_model.predict(X_test_scaled, verbose=0) >= 0.5).astype(int)

print("\n" + "="*50)
print("Architecture Comparison (Test Set):")
print("="*50)
print(f"Simple NN Accuracy: {accuracy_score(y_test, test_pred):.4f}")
print(f"ResNet Accuracy: {accuracy_score(y_test, resnet_pred):.4f}")

Section 5: Regularization and Optimization Techniques

Financial data is noisy and prone to overfitting. Proper regularization is essential.

# Custom Financial Loss Functions
def directional_loss(y_true, y_pred):
    """Loss that penalizes wrong direction more than magnitude."""
    direction_true = tf.sign(y_true)
    direction_pred = tf.sign(y_pred)
    
    # MSE component
    mse_loss = tf.reduce_mean(tf.square(y_true - y_pred))
    
    # Direction penalty
    direction_penalty = tf.reduce_mean(
        tf.cast(direction_true != direction_pred, tf.float32)
    )
    
    return mse_loss + 0.5 * direction_penalty


def sharpe_loss(y_true, y_pred):
    """Loss based on Sharpe ratio approximation."""
    # Predicted returns (position * actual return)
    pred_returns = y_pred * y_true
    
    mean_return = tf.reduce_mean(pred_returns)
    std_return = tf.math.reduce_std(pred_returns) + 1e-8
    
    # Negative Sharpe (to minimize)
    sharpe = mean_return / std_return
    
    return -sharpe


print("Custom loss functions defined")
# Advanced Regularization Techniques
def build_regularized_model(input_dim, l1_reg=0.001, l2_reg=0.01):
    """Build a heavily regularized model for noisy financial data."""
    
    model = Sequential([
        # Input layer with L1/L2 regularization
        layers.Dense(
            64, 
            input_dim=input_dim,
            activation='relu',
            kernel_regularizer=keras.regularizers.l1_l2(l1=l1_reg, l2=l2_reg),
            activity_regularizer=keras.regularizers.l2(l2_reg)
        ),
        layers.BatchNormalization(),
        layers.Dropout(0.4),
        
        # Hidden layers
        layers.Dense(
            32, 
            activation='relu',
            kernel_regularizer=keras.regularizers.l2(l2_reg)
        ),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        
        layers.Dense(
            16, 
            activation='relu',
            kernel_regularizer=keras.regularizers.l2(l2_reg)
        ),
        layers.Dropout(0.2),
        
        # Output
        layers.Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Build and train regularized model
reg_model = build_regularized_model(len(feature_cols))

print("Training regularized model...")
reg_history = reg_model.fit(
    X_train_scaled, y_train_nn,
    validation_data=(X_val_scaled, y_val),
    epochs=50,
    batch_size=32,
    callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
    verbose=0
)

# Compare overfitting
print("\nRegularization Effect:")
print(f"Train-Val accuracy gap (Original): {history.history['accuracy'][-1] - history.history['val_accuracy'][-1]:.4f}")
print(f"Train-Val accuracy gap (Regularized): {reg_history.history['accuracy'][-1] - reg_history.history['val_accuracy'][-1]:.4f}")
# Learning Rate Scheduling
def get_lr_schedule(initial_lr=0.001, decay_steps=1000, decay_rate=0.9):
    """Create exponential decay learning rate schedule."""
    return keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=initial_lr,
        decay_steps=decay_steps,
        decay_rate=decay_rate,
        staircase=True
    )

def get_warmup_schedule(initial_lr=0.0001, target_lr=0.001, warmup_steps=100):
    """Create learning rate schedule with warmup."""
    class WarmupSchedule(keras.optimizers.schedules.LearningRateSchedule):
        def __init__(self, initial_lr, target_lr, warmup_steps):
            self.initial_lr = initial_lr
            self.target_lr = target_lr
            self.warmup_steps = warmup_steps
            
        def __call__(self, step):
            step = tf.cast(step, tf.float32)
            warmup_factor = tf.minimum(step / self.warmup_steps, 1.0)
            return self.initial_lr + warmup_factor * (self.target_lr - self.initial_lr)
    
    return WarmupSchedule(initial_lr, target_lr, warmup_steps)

print("Learning rate schedules defined")

Section 6: Module Project - Deep Learning Trading System

Build a complete deep learning trading system that combines multiple architectures.

# Complete Deep Learning Trading System
class DeepLearningTradingSystem:
    """Production-ready deep learning trading system."""
    
    def __init__(self, sequence_length=20):
        self.sequence_length = sequence_length
        self.feature_scaler = StandardScaler()
        self.models = {}
        self.histories = {}
        
    def create_features(self, df):
        """Create comprehensive feature set."""
        data = df.copy()
        
        # Returns
        for period in [1, 5, 10, 20]:
            data[f'return_{period}d'] = data['close'].pct_change(period)
        
        # Volatility
        for period in [5, 10, 20]:
            data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
        
        # Technical indicators
        for period in [5, 10, 20, 50]:
            sma = data['close'].rolling(period).mean()
            data[f'price_to_sma_{period}'] = data['close'] / sma
        
        # RSI
        delta = data['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
        
        # MACD
        exp12 = data['close'].ewm(span=12).mean()
        exp26 = data['close'].ewm(span=26).mean()
        data['macd'] = exp12 - exp26
        data['macd_signal'] = data['macd'].ewm(span=9).mean()
        
        # Volume
        data['volume_ratio'] = data['volume'] / data['volume'].rolling(20).mean()
        
        # Target
        data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
        
        return data.dropna()
    
    def build_ensemble(self, n_features):
        """Build ensemble of different architectures."""
        
        # 1. Feedforward Network
        ff_model = Sequential([
            layers.Dense(64, input_dim=n_features, activation='relu',
                         kernel_regularizer=keras.regularizers.l2(0.01)),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(32, activation='relu'),
            layers.Dropout(0.2),
            layers.Dense(16, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        ff_model.compile(optimizer=Adam(0.001), 
                         loss='binary_crossentropy', metrics=['accuracy'])
        self.models['feedforward'] = ff_model
        
        # 2. ResNet-style
        self.models['resnet'] = build_resnet_finance(n_features)
        
        print(f"Built ensemble with {len(self.models)} models")
        
    def build_lstm_model(self, n_features):
        """Build LSTM model for sequence data."""
        lstm_model = Sequential([
            layers.LSTM(64, return_sequences=True,
                        input_shape=(self.sequence_length, n_features)),
            layers.Dropout(0.2),
            layers.LSTM(32),
            layers.Dropout(0.2),
            layers.Dense(16, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        lstm_model.compile(optimizer=Adam(0.001),
                           loss='binary_crossentropy', metrics=['accuracy'])
        self.models['lstm'] = lstm_model
        
    def train_models(self, X_train, y_train, X_val, y_val, epochs=50):
        """Train all models in the ensemble."""
        callbacks = [
            EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
        ]
        
        for name, model in self.models.items():
            print(f"\nTraining {name}...")
            history = model.fit(
                X_train, y_train,
                validation_data=(X_val, y_val),
                epochs=epochs,
                batch_size=32,
                callbacks=callbacks,
                verbose=0
            )
            self.histories[name] = history
            print(f"  Val accuracy: {history.history['val_accuracy'][-1]:.4f}")
    
    def predict_ensemble(self, X, weights=None):
        """Make ensemble predictions."""
        if weights is None:
            weights = {name: 1/len(self.models) for name in self.models}
        
        predictions = np.zeros((len(X), 1))
        
        for name, model in self.models.items():
            pred = model.predict(X, verbose=0)
            predictions += weights[name] * pred
        
        return predictions
    
    def generate_signals(self, X, threshold=0.5):
        """Generate trading signals from ensemble."""
        probs = self.predict_ensemble(X)
        signals = np.where(probs >= threshold, 1, -1)
        return signals.flatten(), probs.flatten()
    
    def backtest(self, signals, returns):
        """Simple backtest of signals."""
        strategy_returns = signals * returns
        
        cumulative = (1 + strategy_returns).cumprod()
        
        # Metrics
        total_return = cumulative.iloc[-1] - 1
        sharpe = np.sqrt(252) * strategy_returns.mean() / (strategy_returns.std() + 1e-8)
        max_dd = (cumulative / cumulative.cummax() - 1).min()
        win_rate = (strategy_returns > 0).mean()
        
        return {
            'total_return': total_return,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_dd,
            'win_rate': win_rate,
            'cumulative': cumulative
        }

print("DeepLearningTradingSystem class defined")
# Run the complete system
system = DeepLearningTradingSystem(sequence_length=20)

# Create features
df_system = system.create_features(df)

# Select features
system_features = ['return_1d', 'return_5d', 'return_10d', 'return_20d',
                   'volatility_5d', 'volatility_10d', 'volatility_20d',
                   'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20',
                   'rsi', 'macd', 'volume_ratio']

X_sys = df_system[system_features].values
y_sys = df_system['target'].values
returns = df_system['return_1d'].values

# Time-based splits
split_idx = int(len(X_sys) * 0.8)
X_train_sys, X_test_sys = X_sys[:split_idx], X_sys[split_idx:]
y_train_sys, y_test_sys = y_sys[:split_idx], y_sys[split_idx:]
returns_test = returns[split_idx:]

val_idx = int(len(X_train_sys) * 0.8)
X_train_s, X_val_s = X_train_sys[:val_idx], X_train_sys[val_idx:]
y_train_s, y_val_s = y_train_sys[:val_idx], y_train_sys[val_idx:]

# Scale features
X_train_scaled_s = system.feature_scaler.fit_transform(X_train_s)
X_val_scaled_s = system.feature_scaler.transform(X_val_s)
X_test_scaled_s = system.feature_scaler.transform(X_test_sys)

# Build and train ensemble
system.build_ensemble(len(system_features))
system.train_models(X_train_scaled_s, y_train_s, X_val_scaled_s, y_val_s, epochs=30)
# Generate signals and backtest
signals, probs = system.generate_signals(X_test_scaled_s)

# Convert to pandas for backtest
returns_series = pd.Series(returns_test, 
                           index=df_system.index[split_idx:])
signals_series = pd.Series(signals, 
                           index=df_system.index[split_idx:])

# Run backtest
results = system.backtest(signals_series, returns_series)

print("\n" + "="*50)
print("Deep Learning Trading System Results")
print("="*50)
print(f"Total Return: {results['total_return']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {results['max_drawdown']:.2%}")
print(f"Win Rate: {results['win_rate']:.2%}")
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Cumulative returns
buy_hold = (1 + returns_series).cumprod()
axes[0, 0].plot(results['cumulative'].index, results['cumulative'].values, 
                label='Strategy', linewidth=2)
axes[0, 0].plot(buy_hold.index, buy_hold.values, 
                label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Cumulative Return')
axes[0, 0].set_title('Strategy vs Buy & Hold')
axes[0, 0].legend()

# Prediction probabilities distribution
axes[0, 1].hist(probs, bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].axvline(x=0.5, color='red', linestyle='--', label='Threshold')
axes[0, 1].set_xlabel('Prediction Probability')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Ensemble Prediction Distribution')
axes[0, 1].legend()

# Training history comparison
for name, history in system.histories.items():
    axes[1, 0].plot(history.history['val_accuracy'], label=name)
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Validation Accuracy')
axes[1, 0].set_title('Model Training Comparison')
axes[1, 0].legend()

# Signal distribution over time
signal_ma = pd.Series(signals).rolling(20).mean()
axes[1, 1].plot(signal_ma.values)
axes[1, 1].axhline(y=0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Time')
axes[1, 1].set_ylabel('Signal (20-day MA)')
axes[1, 1].set_title('Trading Signal Trend')

plt.tight_layout()
plt.show()

Exercises

Complete the following exercises to practice deep learning for finance.

Exercise 11.1: Build Custom Neural Network (Guided)

Complete the neural network architecture with proper layers.

Exercise
Solution 11.1
def build_custom_classifier(input_dim, hidden_units=[128, 64, 32]):
    model = Sequential()

    # Add first Dense layer with input_dim
    model.add(layers.Dense(hidden_units[0], input_dim=input_dim, activation='relu'))
    model.add(layers.BatchNormalization())
    model.add(layers.Dropout(0.3))

    # Add remaining hidden layers
    for units in hidden_units[1:]:
        model.add(layers.Dense(units, activation='relu'))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(0.3))

    # Add output layer
    model.add(layers.Dense(1, activation='sigmoid'))

    # Compile model
    model.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])

    return model

Exercise 11.2: Implement LSTM Sequence Model (Guided)

Build an LSTM model for sequence prediction.

Exercise
Solution 11.2
def build_lstm_classifier(sequence_length, n_features, lstm_units=[64, 32]):
    model = Sequential()

    # Add first LSTM layer (return sequences for stacking)
    model.add(layers.LSTM(
        lstm_units[0],
        return_sequences=True,
        input_shape=(sequence_length, n_features)
    ))
    model.add(layers.Dropout(0.2))

    # Add second LSTM layer (no return sequences)
    model.add(layers.LSTM(lstm_units[1], return_sequences=False))
    model.add(layers.Dropout(0.2))

    # Add Dense layers and output
    model.add(layers.Dense(16, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])

    return model

Exercise 11.3: Create Training Pipeline (Guided)

Implement a training pipeline with proper callbacks.

Exercise
Solution 11.3
def train_with_callbacks(model, X_train, y_train, X_val, y_val,
                         epochs=100, batch_size=32):
    # Create callbacks list
    callbacks = [
        EarlyStopping(
            monitor='val_loss',
            patience=10,
            restore_best_weights=True
        ),
        ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=5,
            min_lr=1e-6
        )
    ]

    # Train model
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=epochs,
        batch_size=batch_size,
        callbacks=callbacks,
        verbose=1
    )

    return history

Exercise 11.4: Build a Bidirectional LSTM (Open-ended)

Create a Bidirectional LSTM model that can learn patterns from both past and future context in the sequence.

Exercise
Solution 11.4
def build_bidirectional_lstm(sequence_length, n_features, lstm_units=[64, 32]):
    model = Sequential([
        layers.Bidirectional(
            layers.LSTM(lstm_units[0], return_sequences=True),
            input_shape=(sequence_length, n_features)
        ),
        layers.Dropout(0.2),
        layers.Bidirectional(
            layers.LSTM(lstm_units[1], return_sequences=False)
        ),
        layers.Dropout(0.2),
        layers.Dense(32, activation='relu'),
        layers.BatchNormalization(),
        layers.Dropout(0.2),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])

    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build and test
bilstm = build_bidirectional_lstm(20, 5)
bilstm.summary()

Exercise 11.5: Implement Attention Mechanism (Open-ended)

Add a custom attention layer to an LSTM model to focus on important time steps.

Exercise
Solution 11.5
def build_lstm_attention(sequence_length, n_features, lstm_units=64):
    inputs = layers.Input(shape=(sequence_length, n_features))

    # LSTM layer with sequence output
    lstm_out = layers.LSTM(lstm_units, return_sequences=True)(inputs)
    lstm_out = layers.Dropout(0.2)(lstm_out)

    # Apply attention
    attention_out = AttentionLayer()(lstm_out)

    # Dense layers
    x = layers.Dense(32, activation='relu')(attention_out)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)
    x = layers.Dense(16, activation='relu')(x)

    # Output
    outputs = layers.Dense(1, activation='sigmoid')(x)

    model = Model(inputs, outputs)
    model.compile(
        optimizer=Adam(0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build and test
attention_model = build_lstm_attention(20, 5)
attention_model.summary()

Exercise 11.6: Create Model Ensemble with Weighted Voting (Open-ended)

Build an ensemble of different deep learning architectures with learned weights.

Exercise
Solution 11.6
class WeightedDeepEnsemble:
    def __init__(self):
        self.models = {}
        self.weights = {}
        self.scaler = StandardScaler()

    def build_models(self, input_dim, sequence_length=None, n_features=None):
        # Feedforward
        ff = Sequential([
            layers.Dense(64, input_dim=input_dim, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(32, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ])
        ff.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
        self.models['feedforward'] = ff

        # CNN-1D (if sequence data)
        if sequence_length and n_features:
            cnn = Sequential([
                layers.Conv1D(32, 3, activation='relu', 
                              input_shape=(sequence_length, n_features)),
                layers.MaxPooling1D(2),
                layers.Conv1D(64, 3, activation='relu'),
                layers.GlobalMaxPooling1D(),
                layers.Dense(32, activation='relu'),
                layers.Dense(1, activation='sigmoid')
            ])
            cnn.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
            self.models['cnn'] = cnn

    def train_and_weight(self, X_train, y_train, X_val, y_val, epochs=30):
        val_accuracies = {}

        for name, model in self.models.items():
            history = model.fit(
                X_train, y_train,
                validation_data=(X_val, y_val),
                epochs=epochs,
                batch_size=32,
                callbacks=[EarlyStopping(patience=5, restore_best_weights=True)],
                verbose=0
            )
            val_accuracies[name] = max(history.history['val_accuracy'])

        # Calculate weights based on validation accuracy
        total = sum(val_accuracies.values())
        self.weights = {name: acc/total for name, acc in val_accuracies.items()}
        print(f"Learned weights: {self.weights}")

    def predict(self, X):
        predictions = np.zeros((len(X), 1))
        for name, model in self.models.items():
            predictions += self.weights[name] * model.predict(X, verbose=0)
        return predictions

# Usage
ensemble = WeightedDeepEnsemble()
ensemble.build_models(input_dim=14)

Summary

In this module, you learned:

  1. Neural Network Fundamentals: Building feedforward networks with proper regularization for financial data

  2. LSTM Networks: Implementing sequence models that capture temporal dependencies in price data

  3. Transformers and Attention: Using attention mechanisms to identify important time steps

  4. Architecture Design: Creating multi-input models and residual connections

  5. Regularization Techniques: Preventing overfitting with dropout, batch normalization, and L1/L2 regularization

  6. Production Systems: Building complete trading systems with deep learning ensembles

Key Takeaways

  • Financial data requires heavy regularization due to low signal-to-noise ratio
  • LSTM and Transformers capture different types of temporal patterns
  • Ensemble methods combine strengths of multiple architectures
  • Proper data preprocessing (scaling, sequencing) is critical for deep learning
  • Always use validation sets and early stopping to prevent overfitting

Next Steps

In Module 12, you'll learn about backtesting ML strategies properly, including walk-forward optimization and avoiding common pitfalls like look-ahead bias.

Module 12: Backtesting ML Strategies

Overview

Backtesting ML trading strategies requires special care to avoid common pitfalls like look-ahead bias and overfitting. This module covers proper backtesting methodology, walk-forward optimization, and realistic performance evaluation.

Learning Objectives

By the end of this module, you will be able to: - Implement proper walk-forward validation for ML strategies - Identify and avoid common backtesting pitfalls - Build realistic backtesting frameworks with transaction costs - Apply robust performance evaluation techniques

Prerequisites

  • Module 7: Model Evaluation
  • Module 11: Deep Learning for Finance
  • Understanding of time series cross-validation

Estimated Time: 3 hours


Section 1: Walk-Forward Optimization

Walk-forward optimization is the gold standard for backtesting ML strategies, simulating real-world conditions where models are periodically retrained.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Libraries loaded for backtesting")
# Generate realistic financial data
def generate_backtest_data(n_samples=3000):
    """Generate synthetic data with regime changes."""
    np.random.seed(42)
    
    dates = pd.date_range(start='2015-01-01', periods=n_samples, freq='D')
    
    # Create regime-switching returns
    regime = np.zeros(n_samples)
    current_regime = 0
    
    for i in range(n_samples):
        if np.random.random() < 0.01:  # 1% chance to switch regime
            current_regime = 1 - current_regime
        regime[i] = current_regime
    
    # Generate returns based on regime
    returns = np.where(
        regime == 0,
        np.random.normal(0.0005, 0.012, n_samples),  # Bull regime
        np.random.normal(-0.0002, 0.018, n_samples)  # Bear regime
    )
    
    # Generate prices
    prices = 100 * np.exp(np.cumsum(returns))
    
    # Create OHLCV
    df = pd.DataFrame({
        'date': dates,
        'open': np.roll(prices, 1),
        'high': prices * (1 + np.abs(np.random.normal(0, 0.008, n_samples))),
        'low': prices * (1 - np.abs(np.random.normal(0, 0.008, n_samples))),
        'close': prices,
        'volume': np.random.lognormal(15, 0.5, n_samples),
        'regime': regime
    })
    df.loc[0, 'open'] = df.loc[0, 'close']
    df.set_index('date', inplace=True)
    
    return df

# Generate data
df = generate_backtest_data(3000)
print(f"Dataset: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Samples: {len(df)}")
df.tail()
# Feature engineering
def create_features(df):
    """Create features for ML model."""
    data = df.copy()
    
    # Returns
    for period in [1, 5, 10, 20]:
        data[f'return_{period}d'] = data['close'].pct_change(period)
    
    # Volatility
    for period in [5, 10, 20]:
        data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
    
    # Moving averages
    for period in [5, 10, 20, 50]:
        data[f'sma_{period}'] = data['close'].rolling(period).mean()
        data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
    
    # RSI
    delta = data['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
    
    # MACD
    exp12 = data['close'].ewm(span=12).mean()
    exp26 = data['close'].ewm(span=26).mean()
    data['macd'] = exp12 - exp26
    data['macd_signal'] = data['macd'].ewm(span=9).mean()
    
    # Volume
    data['volume_sma'] = data['volume'].rolling(20).mean()
    data['volume_ratio'] = data['volume'] / data['volume_sma']
    
    # Target: next day direction
    data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
    data['future_return'] = data['close'].pct_change().shift(-1)
    
    return data.dropna()

# Create features
df_features = create_features(df)
print(f"Features created: {len(df_features)} samples")
# Walk-Forward Optimizer
class WalkForwardOptimizer:
    """Walk-forward optimization framework for ML strategies."""
    
    def __init__(self, model, train_window=252, test_window=21, 
                 step_size=21, min_train_samples=100):
        """
        Args:
            model: sklearn-compatible model
            train_window: Number of days for training (1 year = 252)
            test_window: Number of days for testing (1 month = 21)
            step_size: How often to retrain (monthly = 21)
            min_train_samples: Minimum samples needed for training
        """
        self.model = model
        self.train_window = train_window
        self.test_window = test_window
        self.step_size = step_size
        self.min_train_samples = min_train_samples
        self.scaler = StandardScaler()
        self.results = []
        
    def run(self, df, feature_cols, target_col='target'):
        """Run walk-forward optimization."""
        X = df[feature_cols].values
        y = df[target_col].values
        dates = df.index
        
        n_samples = len(X)
        predictions = np.full(n_samples, np.nan)
        probabilities = np.full(n_samples, np.nan)
        
        # Walk-forward loop
        start_idx = self.train_window
        
        fold = 0
        while start_idx + self.test_window <= n_samples:
            # Define train and test indices
            train_start = max(0, start_idx - self.train_window)
            train_end = start_idx
            test_start = start_idx
            test_end = min(start_idx + self.test_window, n_samples)
            
            # Get train and test data
            X_train = X[train_start:train_end]
            y_train = y[train_start:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            # Skip if insufficient training data
            if len(X_train) < self.min_train_samples:
                start_idx += self.step_size
                continue
            
            # Scale features
            X_train_scaled = self.scaler.fit_transform(X_train)
            X_test_scaled = self.scaler.transform(X_test)
            
            # Train model
            self.model.fit(X_train_scaled, y_train)
            
            # Predict
            pred = self.model.predict(X_test_scaled)
            prob = self.model.predict_proba(X_test_scaled)[:, 1]
            
            # Store predictions
            predictions[test_start:test_end] = pred
            probabilities[test_start:test_end] = prob
            
            # Record fold results
            fold_accuracy = accuracy_score(y_test, pred)
            self.results.append({
                'fold': fold,
                'train_start': dates[train_start],
                'train_end': dates[train_end-1],
                'test_start': dates[test_start],
                'test_end': dates[test_end-1],
                'accuracy': fold_accuracy,
                'n_train': len(X_train),
                'n_test': len(X_test)
            })
            
            fold += 1
            start_idx += self.step_size
        
        # Create results dataframe
        df_results = df.copy()
        df_results['prediction'] = predictions
        df_results['probability'] = probabilities
        df_results['signal'] = np.where(predictions == 1, 1, -1)
        
        return df_results
    
    def get_fold_summary(self):
        """Get summary of all folds."""
        return pd.DataFrame(self.results)

print("WalkForwardOptimizer class defined")
# Run walk-forward optimization
feature_cols = ['return_1d', 'return_5d', 'return_10d', 'return_20d',
                'volatility_5d', 'volatility_10d', 'volatility_20d',
                'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20',
                'rsi', 'macd', 'volume_ratio']

# Initialize optimizer with Random Forest
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
wfo = WalkForwardOptimizer(
    model=model,
    train_window=252,  # 1 year training
    test_window=21,    # 1 month testing
    step_size=21       # Retrain monthly
)

# Run walk-forward
results = wfo.run(df_features, feature_cols)

# Get fold summary
fold_summary = wfo.get_fold_summary()
print(f"\nWalk-forward completed: {len(fold_summary)} folds")
print(f"Average accuracy: {fold_summary['accuracy'].mean():.4f}")
print(f"Accuracy std: {fold_summary['accuracy'].std():.4f}")
# Visualize walk-forward results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Accuracy by fold
axes[0, 0].bar(fold_summary['fold'], fold_summary['accuracy'])
axes[0, 0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[0, 0].axhline(y=fold_summary['accuracy'].mean(), color='green', 
                   linestyle='--', label='Average')
axes[0, 0].set_xlabel('Fold')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Accuracy by Fold')
axes[0, 0].legend()

# Rolling accuracy over time
test_mask = ~results['prediction'].isna()
rolling_acc = (results.loc[test_mask, 'prediction'] == 
               results.loc[test_mask, 'target']).rolling(63).mean()
axes[0, 1].plot(rolling_acc.index, rolling_acc.values)
axes[0, 1].axhline(y=0.5, color='red', linestyle='--')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Rolling Accuracy (63-day)')
axes[0, 1].set_title('Accuracy Over Time')

# Prediction probability distribution
valid_probs = results['probability'].dropna()
axes[1, 0].hist(valid_probs, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0.5, color='red', linestyle='--')
axes[1, 0].set_xlabel('Prediction Probability')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Prediction Distribution')

# Training window visualization
axes[1, 1].scatter(fold_summary['fold'], fold_summary['n_train'], 
                   label='Train samples', alpha=0.7)
axes[1, 1].scatter(fold_summary['fold'], fold_summary['n_test'], 
                   label='Test samples', alpha=0.7)
axes[1, 1].set_xlabel('Fold')
axes[1, 1].set_ylabel('Number of Samples')
axes[1, 1].set_title('Sample Sizes by Fold')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

Section 2: Avoiding Backtesting Pitfalls

Common backtesting mistakes can make a losing strategy appear profitable.

# Demonstrating Look-Ahead Bias
def demonstrate_lookahead_bias(df, feature_cols):
    """Show the impact of look-ahead bias."""
    
    # WRONG: Using future information in features
    df_wrong = df.copy()
    # This uses the next day's close to create today's feature!
    df_wrong['future_leak'] = df_wrong['close'].shift(-1) / df_wrong['close'] - 1
    
    # Split data
    split_idx = int(len(df_wrong) * 0.7)
    
    # Train with leaked feature
    X_train = df_wrong[feature_cols + ['future_leak']].iloc[:split_idx].values
    X_test = df_wrong[feature_cols + ['future_leak']].iloc[split_idx:].values
    y_train = df_wrong['target'].iloc[:split_idx].values
    y_test = df_wrong['target'].iloc[split_idx:].values
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train[:-1])  # Remove last row (NaN)
    X_test_scaled = scaler.transform(X_test[:-1])
    y_train = y_train[:-1]
    y_test = y_test[:-1]
    
    model_biased = RandomForestClassifier(n_estimators=100, random_state=42)
    model_biased.fit(X_train_scaled, y_train)
    
    biased_accuracy = accuracy_score(y_test, model_biased.predict(X_test_scaled))
    
    # Train without leaked feature (correct approach)
    X_train_correct = df[feature_cols].iloc[:split_idx].values
    X_test_correct = df[feature_cols].iloc[split_idx:].values
    y_train_correct = df['target'].iloc[:split_idx].values
    y_test_correct = df['target'].iloc[split_idx:].values
    
    scaler2 = StandardScaler()
    X_train_scaled2 = scaler2.fit_transform(X_train_correct)
    X_test_scaled2 = scaler2.transform(X_test_correct)
    
    model_correct = RandomForestClassifier(n_estimators=100, random_state=42)
    model_correct.fit(X_train_scaled2, y_train_correct)
    
    correct_accuracy = accuracy_score(y_test_correct, model_correct.predict(X_test_scaled2))
    
    return biased_accuracy, correct_accuracy

biased_acc, correct_acc = demonstrate_lookahead_bias(df_features, feature_cols)
print("Look-Ahead Bias Demonstration:")
print(f"Accuracy WITH look-ahead bias: {biased_acc:.4f} (unrealistically high!)")
print(f"Accuracy WITHOUT look-ahead bias: {correct_acc:.4f} (realistic)")
# Survivorship Bias Simulator
def simulate_survivorship_bias(n_stocks=100, n_periods=252, survival_rate=0.8):
    """Simulate the impact of survivorship bias."""
    np.random.seed(42)
    
    # Generate returns for all stocks
    all_returns = np.random.normal(0.0003, 0.02, (n_stocks, n_periods))
    
    # Some stocks will "die" (go bankrupt)
    # Dying stocks tend to have worse returns before death
    n_deaths = int(n_stocks * (1 - survival_rate))
    death_indices = np.random.choice(n_stocks, n_deaths, replace=False)
    
    # Make dying stocks have negative returns
    for idx in death_indices:
        death_period = np.random.randint(n_periods // 2, n_periods)
        all_returns[idx, :death_period] = np.random.normal(-0.002, 0.03, death_period)
        all_returns[idx, death_period:] = np.nan  # Dead after this
    
    # Calculate true average (including dead stocks)
    true_avg_return = np.nanmean(all_returns)
    
    # Calculate survivorship-biased average (only surviving stocks)
    survivor_mask = ~np.isnan(all_returns[:, -1])  # Stocks that survived to end
    biased_avg_return = np.mean(all_returns[survivor_mask])
    
    return true_avg_return, biased_avg_return, death_indices

true_ret, biased_ret, deaths = simulate_survivorship_bias()
print("\nSurvivorship Bias Simulation:")
print(f"True average daily return: {true_ret:.4%}")
print(f"Biased average daily return: {biased_ret:.4%}")
print(f"Annualized difference: {(biased_ret - true_ret) * 252:.2%}")
# Overfitting Detection
def detect_overfitting(df, feature_cols, max_depth_range=range(2, 20)):
    """Detect overfitting by comparing train/test performance."""
    
    X = df[feature_cols].values
    y = df['target'].values
    
    # Time-based split
    split_idx = int(len(X) * 0.7)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    results = []
    
    for depth in max_depth_range:
        model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
        model.fit(X_train_scaled, y_train)
        
        train_acc = accuracy_score(y_train, model.predict(X_train_scaled))
        test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
        
        results.append({
            'max_depth': depth,
            'train_accuracy': train_acc,
            'test_accuracy': test_acc,
            'overfit_gap': train_acc - test_acc
        })
    
    return pd.DataFrame(results)

overfit_results = detect_overfitting(df_features, feature_cols)
print("\nOverfitting Analysis:")
print(overfit_results.to_string(index=False))
# Visualize overfitting
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Train vs Test accuracy
axes[0].plot(overfit_results['max_depth'], overfit_results['train_accuracy'], 
             'b-', label='Training', marker='o')
axes[0].plot(overfit_results['max_depth'], overfit_results['test_accuracy'], 
             'r-', label='Test', marker='o')
axes[0].set_xlabel('Max Depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Complexity vs Performance')
axes[0].legend()

# Overfit gap
axes[1].bar(overfit_results['max_depth'], overfit_results['overfit_gap'])
axes[1].axhline(y=0.05, color='red', linestyle='--', label='Warning threshold')
axes[1].set_xlabel('Max Depth')
axes[1].set_ylabel('Train - Test Gap')
axes[1].set_title('Overfitting Gap')
axes[1].legend()

plt.tight_layout()
plt.show()

# Recommend optimal depth
best_depth = overfit_results.loc[overfit_results['test_accuracy'].idxmax(), 'max_depth']
print(f"\nRecommended max_depth: {best_depth}")

Section 3: Realistic Backtesting Framework

A proper backtest must account for transaction costs, slippage, and realistic execution.

# Realistic Backtester
class RealisticBacktester:
    """Backtesting framework with realistic assumptions."""
    
    def __init__(self, initial_capital=100000, commission=0.001, 
                 slippage=0.0005, max_position_size=0.1):
        """
        Args:
            initial_capital: Starting capital
            commission: Commission per trade (0.1% = 0.001)
            slippage: Expected slippage (0.05% = 0.0005)
            max_position_size: Maximum position as fraction of portfolio
        """
        self.initial_capital = initial_capital
        self.commission = commission
        self.slippage = slippage
        self.max_position_size = max_position_size
        
    def run_backtest(self, df, signal_col='signal', return_col='future_return'):
        """Run backtest with realistic assumptions."""
        results = df.copy()
        
        # Remove NaN signals
        mask = ~results[signal_col].isna() & ~results[return_col].isna()
        results = results[mask].copy()
        
        # Initialize tracking
        capital = self.initial_capital
        position = 0  # 1 = long, -1 = short, 0 = flat
        
        capitals = []
        positions = []
        trades = []
        costs = []
        
        for idx, row in results.iterrows():
            signal = row[signal_col]
            ret = row[return_col]
            
            # Calculate position change
            if signal != position:
                # Trade occurred
                trade_cost = abs(signal - position) * capital * (self.commission + self.slippage)
                capital -= trade_cost
                trades.append(1)
                costs.append(trade_cost)
            else:
                trades.append(0)
                costs.append(0)
            
            # Update position
            position = signal
            
            # Apply position sizing
            effective_position = position * self.max_position_size
            
            # Calculate return
            capital = capital * (1 + effective_position * ret)
            
            capitals.append(capital)
            positions.append(position)
        
        results['capital'] = capitals
        results['position'] = positions
        results['trade'] = trades
        results['cost'] = costs
        
        return results
    
    def calculate_metrics(self, results):
        """Calculate performance metrics."""
        capitals = results['capital'].values
        
        # Returns
        returns = np.diff(capitals) / capitals[:-1]
        
        # Total return
        total_return = (capitals[-1] / capitals[0]) - 1
        
        # Annualized return
        n_years = len(capitals) / 252
        annual_return = (1 + total_return) ** (1/n_years) - 1
        
        # Sharpe ratio
        sharpe = np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8)
        
        # Max drawdown
        peak = np.maximum.accumulate(capitals)
        drawdown = (capitals - peak) / peak
        max_drawdown = np.min(drawdown)
        
        # Win rate
        win_rate = np.mean(returns > 0)
        
        # Trade statistics
        n_trades = results['trade'].sum()
        total_costs = results['cost'].sum()
        
        return {
            'total_return': total_return,
            'annual_return': annual_return,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_drawdown,
            'win_rate': win_rate,
            'n_trades': n_trades,
            'total_costs': total_costs,
            'cost_drag': total_costs / self.initial_capital
        }

print("RealisticBacktester class defined")
# Run realistic backtest
backtester = RealisticBacktester(
    initial_capital=100000,
    commission=0.001,  # 0.1%
    slippage=0.0005,   # 0.05%
    max_position_size=1.0  # Full position
)

# Use walk-forward results
backtest_results = backtester.run_backtest(results, signal_col='signal', return_col='future_return')

# Calculate metrics
metrics = backtester.calculate_metrics(backtest_results)

print("\n" + "="*50)
print("Realistic Backtest Results")
print("="*50)
for key, value in metrics.items():
    if 'return' in key or 'drawdown' in key or 'rate' in key or 'drag' in key:
        print(f"{key}: {value:.2%}")
    elif 'ratio' in key:
        print(f"{key}: {value:.2f}")
    else:
        print(f"{key}: {value:,.0f}")
# Compare with and without costs
backtester_no_costs = RealisticBacktester(
    initial_capital=100000,
    commission=0,
    slippage=0,
    max_position_size=1.0
)

results_no_costs = backtester_no_costs.run_backtest(results)
metrics_no_costs = backtester_no_costs.calculate_metrics(results_no_costs)

print("\nImpact of Transaction Costs:")
print(f"Return without costs: {metrics_no_costs['total_return']:.2%}")
print(f"Return with costs: {metrics['total_return']:.2%}")
print(f"Cost impact: {metrics_no_costs['total_return'] - metrics['total_return']:.2%}")
# Visualize backtest
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Equity curve
axes[0, 0].plot(backtest_results.index, backtest_results['capital'], 
                label='Strategy', linewidth=2)

# Buy and hold comparison
bh_capital = 100000 * (1 + backtest_results['future_return']).cumprod()
axes[0, 0].plot(backtest_results.index, bh_capital, 
                label='Buy & Hold', alpha=0.7, linewidth=2)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Portfolio Value ($)')
axes[0, 0].set_title('Equity Curve')
axes[0, 0].legend()

# Drawdown
peak = backtest_results['capital'].cummax()
drawdown = (backtest_results['capital'] - peak) / peak
axes[0, 1].fill_between(drawdown.index, drawdown.values, 0, alpha=0.7)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown Over Time')

# Position over time
axes[1, 0].plot(backtest_results.index, backtest_results['position'], 
                linewidth=0.5)
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('Position')
axes[1, 0].set_title('Position Over Time')

# Cumulative costs
cum_costs = backtest_results['cost'].cumsum()
axes[1, 1].plot(backtest_results.index, cum_costs)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Cumulative Costs ($)')
axes[1, 1].set_title('Transaction Costs')

plt.tight_layout()
plt.show()

Section 4: Robustness Testing

Testing strategy robustness helps ensure performance isn't due to luck or overfitting.

# Monte Carlo Simulation for Strategy Robustness
def monte_carlo_robustness(returns, n_simulations=1000):
    """Test strategy robustness using Monte Carlo."""
    np.random.seed(42)
    
    original_sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)
    original_total = (1 + returns).prod() - 1
    
    simulated_sharpes = []
    simulated_returns = []
    
    for _ in range(n_simulations):
        # Randomly shuffle returns
        shuffled = np.random.permutation(returns)
        
        sim_sharpe = np.sqrt(252) * shuffled.mean() / (shuffled.std() + 1e-8)
        sim_total = (1 + shuffled).prod() - 1
        
        simulated_sharpes.append(sim_sharpe)
        simulated_returns.append(sim_total)
    
    # Calculate percentiles
    sharpe_percentile = (np.array(simulated_sharpes) < original_sharpe).mean() * 100
    return_percentile = (np.array(simulated_returns) < original_total).mean() * 100
    
    return {
        'original_sharpe': original_sharpe,
        'simulated_sharpes': simulated_sharpes,
        'sharpe_percentile': sharpe_percentile,
        'original_return': original_total,
        'simulated_returns': simulated_returns,
        'return_percentile': return_percentile
    }

# Run Monte Carlo
strategy_returns = backtest_results['capital'].pct_change().dropna()
mc_results = monte_carlo_robustness(strategy_returns.values)

print("\nMonte Carlo Robustness Test:")
print(f"Strategy Sharpe: {mc_results['original_sharpe']:.2f}")
print(f"Sharpe percentile: {mc_results['sharpe_percentile']:.1f}%")
print(f"Strategy Return: {mc_results['original_return']:.2%}")
print(f"Return percentile: {mc_results['return_percentile']:.1f}%")
# Visualize Monte Carlo results
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Sharpe distribution
axes[0].hist(mc_results['simulated_sharpes'], bins=50, alpha=0.7, edgecolor='black')
axes[0].axvline(x=mc_results['original_sharpe'], color='red', linewidth=2,
                label=f'Strategy: {mc_results["original_sharpe"]:.2f}')
axes[0].set_xlabel('Sharpe Ratio')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Monte Carlo Sharpe Distribution\n(Percentile: {mc_results["sharpe_percentile"]:.1f}%)')
axes[0].legend()

# Return distribution
axes[1].hist(mc_results['simulated_returns'], bins=50, alpha=0.7, edgecolor='black')
axes[1].axvline(x=mc_results['original_return'], color='red', linewidth=2,
                label=f'Strategy: {mc_results["original_return"]:.2%}')
axes[1].set_xlabel('Total Return')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Monte Carlo Return Distribution\n(Percentile: {mc_results["return_percentile"]:.1f}%)')
axes[1].legend()

plt.tight_layout()
plt.show()
# Parameter Sensitivity Analysis
def sensitivity_analysis(df, feature_cols, param_name, param_values):
    """Analyze sensitivity to a parameter."""
    results = []
    
    X = df[feature_cols].values
    y = df['target'].values
    
    split_idx = int(len(X) * 0.7)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for value in param_values:
        params = {param_name: value}
        model = RandomForestClassifier(n_estimators=100, random_state=42, **params)
        model.fit(X_train_scaled, y_train)
        
        test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
        results.append({'param_value': value, 'test_accuracy': test_acc})
    
    return pd.DataFrame(results)

# Test sensitivity to max_depth
depth_sensitivity = sensitivity_analysis(
    df_features, feature_cols, 
    'max_depth', range(2, 15)
)

# Test sensitivity to n_estimators
n_est_sensitivity = sensitivity_analysis(
    df_features, feature_cols,
    'n_estimators', [10, 25, 50, 100, 200, 300]
)

print("Parameter Sensitivity:")
print("\nMax Depth:")
print(depth_sensitivity.to_string(index=False))
# Visualize sensitivity
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(depth_sensitivity['param_value'], depth_sensitivity['test_accuracy'], 
             marker='o', linewidth=2)
axes[0].set_xlabel('Max Depth')
axes[0].set_ylabel('Test Accuracy')
axes[0].set_title('Sensitivity to Max Depth')

axes[1].plot(n_est_sensitivity['param_value'], n_est_sensitivity['test_accuracy'],
             marker='o', linewidth=2)
axes[1].set_xlabel('Number of Estimators')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Sensitivity to N Estimators')

plt.tight_layout()
plt.show()

Section 5: Module Project - Complete Backtesting System

Build a complete backtesting system with all proper safeguards.

# Complete ML Backtesting System
class MLBacktestingSystem:
    """Complete ML strategy backtesting system."""
    
    def __init__(self, model, initial_capital=100000,
                 commission=0.001, slippage=0.0005):
        self.model = model
        self.initial_capital = initial_capital
        self.commission = commission
        self.slippage = slippage
        self.scaler = StandardScaler()
        self.walk_forward_results = None
        self.backtest_results = None
        
    def run_walk_forward(self, df, feature_cols, target_col='target',
                          train_window=252, test_window=21, step_size=21):
        """Run walk-forward optimization."""
        X = df[feature_cols].values
        y = df[target_col].values
        dates = df.index
        
        n_samples = len(X)
        predictions = np.full(n_samples, np.nan)
        probabilities = np.full(n_samples, np.nan)
        
        start_idx = train_window
        fold_results = []
        
        while start_idx + test_window <= n_samples:
            train_start = max(0, start_idx - train_window)
            train_end = start_idx
            test_start = start_idx
            test_end = min(start_idx + test_window, n_samples)
            
            X_train = X[train_start:train_end]
            y_train = y[train_start:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            X_train_scaled = self.scaler.fit_transform(X_train)
            X_test_scaled = self.scaler.transform(X_test)
            
            self.model.fit(X_train_scaled, y_train)
            
            pred = self.model.predict(X_test_scaled)
            prob = self.model.predict_proba(X_test_scaled)[:, 1]
            
            predictions[test_start:test_end] = pred
            probabilities[test_start:test_end] = prob
            
            fold_results.append({
                'test_start': dates[test_start],
                'test_end': dates[test_end-1],
                'accuracy': accuracy_score(y_test, pred)
            })
            
            start_idx += step_size
        
        results = df.copy()
        results['prediction'] = predictions
        results['probability'] = probabilities
        results['signal'] = np.where(predictions == 1, 1, -1)
        
        self.walk_forward_results = results
        self.fold_summary = pd.DataFrame(fold_results)
        
        return results
    
    def run_backtest(self, signal_col='signal', return_col='future_return'):
        """Run realistic backtest."""
        if self.walk_forward_results is None:
            raise ValueError("Run walk_forward first")
        
        df = self.walk_forward_results.copy()
        mask = ~df[signal_col].isna() & ~df[return_col].isna()
        df = df[mask].copy()
        
        capital = self.initial_capital
        position = 0
        
        capitals = []
        trades = []
        
        for _, row in df.iterrows():
            signal = row[signal_col]
            ret = row[return_col]
            
            if signal != position:
                cost = abs(signal - position) * capital * (self.commission + self.slippage)
                capital -= cost
                trades.append(1)
            else:
                trades.append(0)
            
            position = signal
            capital = capital * (1 + position * ret)
            capitals.append(capital)
        
        df['capital'] = capitals
        df['trade'] = trades
        
        self.backtest_results = df
        return df
    
    def calculate_metrics(self):
        """Calculate all performance metrics."""
        if self.backtest_results is None:
            raise ValueError("Run backtest first")
        
        capitals = self.backtest_results['capital'].values
        returns = np.diff(capitals) / capitals[:-1]
        
        total_return = (capitals[-1] / capitals[0]) - 1
        n_years = len(capitals) / 252
        annual_return = (1 + total_return) ** (1/n_years) - 1
        sharpe = np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8)
        
        peak = np.maximum.accumulate(capitals)
        drawdown = (capitals - peak) / peak
        max_drawdown = np.min(drawdown)
        
        # Walk-forward metrics
        avg_fold_accuracy = self.fold_summary['accuracy'].mean()
        accuracy_std = self.fold_summary['accuracy'].std()
        
        return {
            'total_return': total_return,
            'annual_return': annual_return,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_drawdown,
            'avg_fold_accuracy': avg_fold_accuracy,
            'accuracy_std': accuracy_std,
            'n_trades': self.backtest_results['trade'].sum(),
            'n_folds': len(self.fold_summary)
        }
    
    def run_robustness_tests(self, n_simulations=500):
        """Run Monte Carlo robustness tests."""
        returns = self.backtest_results['capital'].pct_change().dropna().values
        
        original_sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)
        
        simulated_sharpes = []
        for _ in range(n_simulations):
            shuffled = np.random.permutation(returns)
            sim_sharpe = np.sqrt(252) * shuffled.mean() / (shuffled.std() + 1e-8)
            simulated_sharpes.append(sim_sharpe)
        
        percentile = (np.array(simulated_sharpes) < original_sharpe).mean() * 100
        
        return {
            'original_sharpe': original_sharpe,
            'sharpe_percentile': percentile,
            'is_significant': percentile > 95
        }
    
    def generate_report(self):
        """Generate complete backtest report."""
        metrics = self.calculate_metrics()
        robustness = self.run_robustness_tests()
        
        print("\n" + "="*60)
        print("ML STRATEGY BACKTEST REPORT")
        print("="*60)
        
        print("\n--- Performance Metrics ---")
        print(f"Total Return: {metrics['total_return']:.2%}")
        print(f"Annual Return: {metrics['annual_return']:.2%}")
        print(f"Sharpe Ratio: {metrics['sharpe_ratio']:.2f}")
        print(f"Max Drawdown: {metrics['max_drawdown']:.2%}")
        
        print("\n--- Walk-Forward Results ---")
        print(f"Number of Folds: {metrics['n_folds']}")
        print(f"Average Fold Accuracy: {metrics['avg_fold_accuracy']:.4f}")
        print(f"Accuracy Std Dev: {metrics['accuracy_std']:.4f}")
        
        print("\n--- Trading Statistics ---")
        print(f"Number of Trades: {metrics['n_trades']}")
        
        print("\n--- Robustness Tests ---")
        print(f"Strategy Sharpe: {robustness['original_sharpe']:.2f}")
        print(f"Sharpe Percentile: {robustness['sharpe_percentile']:.1f}%")
        print(f"Statistically Significant: {robustness['is_significant']}")
        
        return metrics, robustness

print("MLBacktestingSystem class defined")
# Run complete backtesting system
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

system = MLBacktestingSystem(
    model=model,
    initial_capital=100000,
    commission=0.001,
    slippage=0.0005
)

# Run walk-forward
wf_results = system.run_walk_forward(
    df_features, feature_cols,
    train_window=252,
    test_window=21,
    step_size=21
)

# Run backtest
bt_results = system.run_backtest()

# Generate report
metrics, robustness = system.generate_report()
# Comprehensive visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Equity curve
axes[0, 0].plot(bt_results.index, bt_results['capital'], linewidth=2)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Portfolio Value ($)')
axes[0, 0].set_title('Equity Curve')

# Drawdown
peak = bt_results['capital'].cummax()
dd = (bt_results['capital'] - peak) / peak
axes[0, 1].fill_between(dd.index, dd.values, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown')

# Walk-forward accuracy
axes[0, 2].bar(range(len(system.fold_summary)), system.fold_summary['accuracy'])
axes[0, 2].axhline(y=0.5, color='red', linestyle='--')
axes[0, 2].axhline(y=system.fold_summary['accuracy'].mean(), color='green', linestyle='--')
axes[0, 2].set_xlabel('Fold')
axes[0, 2].set_ylabel('Accuracy')
axes[0, 2].set_title('Walk-Forward Accuracy')

# Monthly returns
monthly_returns = bt_results['capital'].resample('M').last().pct_change().dropna()
colors = ['green' if r > 0 else 'red' for r in monthly_returns]
axes[1, 0].bar(range(len(monthly_returns)), monthly_returns.values, color=colors)
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Return')
axes[1, 0].set_title('Monthly Returns')

# Return distribution
daily_returns = bt_results['capital'].pct_change().dropna()
axes[1, 1].hist(daily_returns, bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Daily Return')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Return Distribution')

# Rolling Sharpe
rolling_sharpe = np.sqrt(252) * daily_returns.rolling(63).mean() / daily_returns.rolling(63).std()
axes[1, 2].plot(rolling_sharpe.index, rolling_sharpe.values)
axes[1, 2].axhline(y=0, color='red', linestyle='--')
axes[1, 2].set_xlabel('Date')
axes[1, 2].set_ylabel('Rolling Sharpe (63-day)')
axes[1, 2].set_title('Rolling Sharpe Ratio')

plt.tight_layout()
plt.show()

Exercises

Complete the following exercises to practice ML backtesting.

Exercise 12.1: Implement Walk-Forward Split (Guided)

Create a function that generates walk-forward train/test indices.

Exercise
Solution 12.1
def walk_forward_split(n_samples, train_size, test_size, step_size):
    splits = []

    # Start from end of first training window
    start_idx = train_size

    while start_idx + test_size <= n_samples:
        # Calculate train indices
        train_start = max(0, start_idx - train_size)
        train_end = start_idx
        train_indices = list(range(train_start, train_end))

        # Calculate test indices
        test_start = start_idx
        test_end = min(start_idx + test_size, n_samples)
        test_indices = list(range(test_start, test_end))

        splits.append((train_indices, test_indices))

        # Move to next fold
        start_idx += step_size

    return splits

Exercise 12.2: Calculate Strategy Metrics (Guided)

Implement a function to calculate key performance metrics.

Exercise
Solution 12.2
def calculate_strategy_metrics(returns):
    returns = np.array(returns)

    # Calculate total return
    total_return = (1 + returns).prod() - 1

    # Calculate Sharpe ratio (annualized)
    sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)

    # Calculate max drawdown
    cumulative = (1 + returns).cumprod()
    peak = np.maximum.accumulate(cumulative)
    drawdown = (cumulative - peak) / peak
    max_dd = drawdown.min()

    # Calculate win rate
    win_rate = (returns > 0).mean()

    return {
        'total_return': total_return,
        'sharpe_ratio': sharpe,
        'max_drawdown': max_dd,
        'win_rate': win_rate
    }

Exercise 12.3: Implement Transaction Cost Calculator (Guided)

Create a function that calculates transaction costs for a signal series.

Exercise
Solution 12.3
def calculate_transaction_costs(signals, prices, commission=0.001, slippage=0.0005):
    signals = np.array(signals)
    prices = np.array(prices)

    # Calculate position changes
    position_changes = np.abs(np.diff(signals))
    position_changes = np.insert(position_changes, 0, abs(signals[0]))

    # Calculate trade values
    trade_values = position_changes * prices

    # Calculate costs
    costs = trade_values * (commission + slippage)

    return {
        'total_cost': costs.sum(),
        'n_trades': (position_changes > 0).sum(),
        'costs_per_trade': costs
    }

Exercise 12.4: Build Expanding Window Backtester (Open-ended)

Create a backtester that uses expanding training windows instead of fixed rolling windows.

Exercise
Solution 12.4
class ExpandingWindowBacktester:
    def __init__(self, model, min_train_samples=252, test_window=21):
        self.model = model
        self.min_train_samples = min_train_samples
        self.test_window = test_window
        self.scaler = StandardScaler()
        self.results = []

    def run(self, X, y):
        n_samples = len(X)
        predictions = np.full(n_samples, np.nan)

        start_idx = self.min_train_samples

        while start_idx + self.test_window <= n_samples:
            # Expanding window: use ALL data from beginning
            train_start = 0  # Always start from beginning
            train_end = start_idx
            test_start = start_idx
            test_end = start_idx + self.test_window

            X_train = X[train_start:train_end]
            y_train = y[train_start:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]

            X_train_scaled = self.scaler.fit_transform(X_train)
            X_test_scaled = self.scaler.transform(X_test)

            self.model.fit(X_train_scaled, y_train)
            pred = self.model.predict(X_test_scaled)

            predictions[test_start:test_end] = pred

            self.results.append({
                'train_size': len(X_train),
                'accuracy': accuracy_score(y_test, pred)
            })

            start_idx += self.test_window

        return predictions, pd.DataFrame(self.results)

# Compare with rolling
expanding = ExpandingWindowBacktester(
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    min_train_samples=252
)
exp_pred, exp_results = expanding.run(X, y)
print(f"Expanding window avg accuracy: {exp_results['accuracy'].mean():.4f}")

Exercise 12.5: Implement Bootstrap Confidence Intervals (Open-ended)

Create a function that calculates bootstrap confidence intervals for strategy metrics.

Exercise
Solution 12.5
def bootstrap_confidence_intervals(returns, n_bootstrap=1000, confidence_level=0.95):
    """Calculate bootstrap confidence intervals for strategy metrics."""
    np.random.seed(42)
    returns = np.array(returns)
    n_samples = len(returns)

    bootstrap_sharpes = []
    bootstrap_returns = []

    for _ in range(n_bootstrap):
        # Resample with replacement
        sample_indices = np.random.choice(n_samples, size=n_samples, replace=True)
        sample_returns = returns[sample_indices]

        # Calculate metrics
        sharpe = np.sqrt(252) * sample_returns.mean() / (sample_returns.std() + 1e-8)
        total_ret = (1 + sample_returns).prod() - 1

        bootstrap_sharpes.append(sharpe)
        bootstrap_returns.append(total_ret)

    # Calculate confidence intervals
    alpha = (1 - confidence_level) / 2

    sharpe_ci = (
        np.percentile(bootstrap_sharpes, alpha * 100),
        np.percentile(bootstrap_sharpes, (1 - alpha) * 100)
    )

    return_ci = (
        np.percentile(bootstrap_returns, alpha * 100),
        np.percentile(bootstrap_returns, (1 - alpha) * 100)
    )

    # Is significantly positive?
    sharpe_significant = sharpe_ci[0] > 0

    return {
        'sharpe_ci': sharpe_ci,
        'return_ci': return_ci,
        'sharpe_significant': sharpe_significant,
        'original_sharpe': np.sqrt(252) * returns.mean() / returns.std()
    }

# Test
strategy_returns = backtest_results['capital'].pct_change().dropna().values
ci_results = bootstrap_confidence_intervals(strategy_returns)
print(f"Sharpe 95% CI: ({ci_results['sharpe_ci'][0]:.2f}, {ci_results['sharpe_ci'][1]:.2f})")
print(f"Statistically significant: {ci_results['sharpe_significant']}")

Exercise 12.6: Build Regime-Aware Backtester (Open-ended)

Create a backtester that tracks performance across different market regimes.

Exercise
Solution 12.6
class RegimeAwareBacktester:
    def __init__(self, lookback=63):
        self.lookback = lookback

    def identify_regimes(self, prices):
        """Identify market regimes based on price trend."""
        returns = pd.Series(prices).pct_change()
        rolling_return = returns.rolling(self.lookback).mean() * 252
        rolling_vol = returns.rolling(self.lookback).std() * np.sqrt(252)

        regimes = pd.Series(index=range(len(prices)), dtype=str)

        for i in range(len(prices)):
            if pd.isna(rolling_return.iloc[i]):
                regimes.iloc[i] = 'unknown'
            elif rolling_return.iloc[i] > 0.1:  # >10% annualized
                regimes.iloc[i] = 'bull'
            elif rolling_return.iloc[i] < -0.1:  # <-10% annualized
                regimes.iloc[i] = 'bear'
            else:
                regimes.iloc[i] = 'sideways'

        return regimes

    def analyze_by_regime(self, strategy_returns, prices):
        """Analyze strategy performance by regime."""
        regimes = self.identify_regimes(prices)

        results = {}
        for regime in ['bull', 'bear', 'sideways']:
            mask = regimes == regime
            regime_returns = strategy_returns[mask]

            if len(regime_returns) > 0:
                results[regime] = {
                    'n_days': len(regime_returns),
                    'total_return': (1 + regime_returns).prod() - 1,
                    'sharpe': np.sqrt(252) * regime_returns.mean() / (regime_returns.std() + 1e-8),
                    'win_rate': (regime_returns > 0).mean()
                }

        return pd.DataFrame(results).T

# Usage
regime_analyzer = RegimeAwareBacktester(lookback=63)
strategy_rets = backtest_results['capital'].pct_change().dropna().values
prices = backtest_results['close'].values
regime_results = regime_analyzer.analyze_by_regime(strategy_rets, prices)
print("Performance by Regime:")
print(regime_results)

Summary

In this module, you learned:

  1. Walk-Forward Optimization: Proper methodology for testing ML strategies on unseen data

  2. Avoiding Pitfalls: Look-ahead bias, survivorship bias, and overfitting detection

  3. Realistic Backtesting: Accounting for transaction costs, slippage, and execution

  4. Robustness Testing: Monte Carlo simulations and parameter sensitivity analysis

  5. Complete Systems: Building production-ready backtesting frameworks

Key Takeaways

  • Walk-forward validation is essential for ML strategies to avoid look-ahead bias
  • Transaction costs can significantly impact strategy performance
  • Monte Carlo tests help distinguish skill from luck
  • Overfitting is the #1 enemy of ML trading strategies
  • Always test robustness before deploying any strategy

Next Steps

In Module 13, you'll learn about deploying ML models to production, including feature pipelines, model monitoring, and system architecture.

Module 13: Production ML Systems

Overview

Moving ML models from research to production requires careful engineering. This module covers the infrastructure, pipelines, and monitoring needed to deploy ML trading systems reliably.

Learning Objectives

By the end of this module, you will be able to: - Design feature pipelines for real-time prediction - Implement model versioning and deployment strategies - Build monitoring systems to detect model degradation - Create robust error handling and fallback mechanisms

Prerequisites

  • Module 11: Deep Learning for Finance
  • Module 12: Backtesting ML Strategies
  • Basic understanding of software engineering principles

Estimated Time: 3.5 hours


Section 1: Feature Pipeline Architecture

A robust feature pipeline ensures consistent feature computation between training and inference.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
import json
import hashlib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Production ML libraries loaded")
# Feature Definition Framework
@dataclass
class FeatureDefinition:
    """Defines a single feature for the pipeline."""
    name: str
    feature_type: str  # 'price', 'volume', 'technical', 'derived'
    lookback_periods: int
    dependencies: List[str] = field(default_factory=list)
    params: Dict[str, Any] = field(default_factory=dict)
    
    def to_dict(self) -> Dict:
        return {
            'name': self.name,
            'feature_type': self.feature_type,
            'lookback_periods': self.lookback_periods,
            'dependencies': self.dependencies,
            'params': self.params
        }


class FeatureRegistry:
    """Central registry for all feature definitions."""
    
    def __init__(self):
        self.features: Dict[str, FeatureDefinition] = {}
        self.computation_order: List[str] = []
        
    def register(self, feature: FeatureDefinition):
        """Register a feature definition."""
        self.features[feature.name] = feature
        self._update_computation_order()
        
    def _update_computation_order(self):
        """Topologically sort features based on dependencies."""
        visited = set()
        order = []
        
        def visit(name):
            if name in visited:
                return
            visited.add(name)
            if name in self.features:
                for dep in self.features[name].dependencies:
                    visit(dep)
                order.append(name)
        
        for name in self.features:
            visit(name)
        
        self.computation_order = order
    
    def get_max_lookback(self) -> int:
        """Get maximum lookback period needed."""
        return max(f.lookback_periods for f in self.features.values())
    
    def get_feature_hash(self) -> str:
        """Generate hash of feature definitions for versioning."""
        feature_str = json.dumps(
            {name: f.to_dict() for name, f in sorted(self.features.items())},
            sort_keys=True
        )
        return hashlib.md5(feature_str.encode()).hexdigest()[:8]

print("Feature definition framework created")
# Feature Computation Engine
class FeatureComputer(ABC):
    """Abstract base class for feature computation."""
    
    @abstractmethod
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        pass


class ReturnFeature(FeatureComputer):
    """Compute return features."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        period = params.get('period', 1)
        return df['close'].pct_change(period)


class VolatilityFeature(FeatureComputer):
    """Compute volatility features."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        period = params.get('period', 20)
        returns = df['close'].pct_change()
        return returns.rolling(period).std()


class SMAFeature(FeatureComputer):
    """Compute simple moving average features."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        period = params.get('period', 20)
        return df['close'].rolling(period).mean()


class RSIFeature(FeatureComputer):
    """Compute RSI feature."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        period = params.get('period', 14)
        delta = df['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(period).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(period).mean()
        rs = gain / (loss + 1e-10)
        return 100 - (100 / (1 + rs))


class MACDFeature(FeatureComputer):
    """Compute MACD feature."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        fast = params.get('fast', 12)
        slow = params.get('slow', 26)
        exp_fast = df['close'].ewm(span=fast).mean()
        exp_slow = df['close'].ewm(span=slow).mean()
        return exp_fast - exp_slow


class PriceToSMAFeature(FeatureComputer):
    """Compute price relative to SMA."""
    
    def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
        period = params.get('period', 20)
        sma = df['close'].rolling(period).mean()
        return df['close'] / sma


# Feature Computer Registry
FEATURE_COMPUTERS = {
    'return': ReturnFeature(),
    'volatility': VolatilityFeature(),
    'sma': SMAFeature(),
    'rsi': RSIFeature(),
    'macd': MACDFeature(),
    'price_to_sma': PriceToSMAFeature()
}

print("Feature computers registered")
# Production Feature Pipeline
class FeaturePipeline:
    """Production-ready feature pipeline."""
    
    def __init__(self, registry: FeatureRegistry):
        self.registry = registry
        self.scaler = StandardScaler()
        self.is_fitted = False
        self.feature_stats = {}
        
    def compute_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Compute all registered features."""
        result = df.copy()
        
        for feature_name in self.registry.computation_order:
            feature_def = self.registry.features[feature_name]
            computer = FEATURE_COMPUTERS.get(feature_def.feature_type)
            
            if computer:
                result[feature_name] = computer.compute(result, feature_def.params)
        
        return result
    
    def fit(self, df: pd.DataFrame):
        """Fit the pipeline on training data."""
        features_df = self.compute_features(df)
        feature_cols = list(self.registry.features.keys())
        
        # Store feature statistics
        for col in feature_cols:
            self.feature_stats[col] = {
                'mean': features_df[col].mean(),
                'std': features_df[col].std(),
                'min': features_df[col].min(),
                'max': features_df[col].max()
            }
        
        # Fit scaler
        valid_data = features_df[feature_cols].dropna()
        self.scaler.fit(valid_data)
        self.is_fitted = True
        
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Transform data using fitted pipeline."""
        if not self.is_fitted:
            raise ValueError("Pipeline not fitted. Call fit() first.")
        
        features_df = self.compute_features(df)
        feature_cols = list(self.registry.features.keys())
        
        # Scale features
        valid_mask = ~features_df[feature_cols].isna().any(axis=1)
        result = features_df.copy()
        result.loc[valid_mask, feature_cols] = self.scaler.transform(
            features_df.loc[valid_mask, feature_cols]
        )
        
        return result
    
    def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Fit and transform in one step."""
        self.fit(df)
        return self.transform(df)
    
    def get_feature_vector(self, df: pd.DataFrame) -> np.ndarray:
        """Get feature vector for prediction."""
        transformed = self.transform(df)
        feature_cols = list(self.registry.features.keys())
        return transformed[feature_cols].iloc[-1].values
    
    def validate_features(self, df: pd.DataFrame) -> Dict[str, Any]:
        """Validate features against training statistics."""
        features_df = self.compute_features(df)
        validation_results = {}
        
        for col, stats in self.feature_stats.items():
            current_value = features_df[col].iloc[-1]
            
            # Check for outliers
            z_score = (current_value - stats['mean']) / (stats['std'] + 1e-10)
            is_outlier = abs(z_score) > 3
            
            validation_results[col] = {
                'value': current_value,
                'z_score': z_score,
                'is_outlier': is_outlier
            }
        
        return validation_results
    
    def save(self, filepath: str):
        """Save pipeline to file."""
        state = {
            'registry': self.registry,
            'scaler': self.scaler,
            'feature_stats': self.feature_stats,
            'is_fitted': self.is_fitted,
            'feature_hash': self.registry.get_feature_hash()
        }
        with open(filepath, 'wb') as f:
            pickle.dump(state, f)
    
    @classmethod
    def load(cls, filepath: str) -> 'FeaturePipeline':
        """Load pipeline from file."""
        with open(filepath, 'rb') as f:
            state = pickle.load(f)
        
        pipeline = cls(state['registry'])
        pipeline.scaler = state['scaler']
        pipeline.feature_stats = state['feature_stats']
        pipeline.is_fitted = state['is_fitted']
        
        return pipeline

print("FeaturePipeline class defined")
# Create and test feature pipeline
# Generate sample data
def generate_sample_data(n_samples=1000):
    np.random.seed(42)
    dates = pd.date_range(start='2020-01-01', periods=n_samples, freq='D')
    returns = np.random.normal(0.0003, 0.015, n_samples)
    prices = 100 * np.exp(np.cumsum(returns))
    
    return pd.DataFrame({
        'date': dates,
        'open': np.roll(prices, 1),
        'high': prices * (1 + np.abs(np.random.normal(0, 0.01, n_samples))),
        'low': prices * (1 - np.abs(np.random.normal(0, 0.01, n_samples))),
        'close': prices,
        'volume': np.random.lognormal(15, 0.5, n_samples)
    }).set_index('date')

df = generate_sample_data()

# Create feature registry
registry = FeatureRegistry()

# Register features
registry.register(FeatureDefinition('return_1d', 'return', 1, params={'period': 1}))
registry.register(FeatureDefinition('return_5d', 'return', 5, params={'period': 5}))
registry.register(FeatureDefinition('volatility_20d', 'volatility', 20, params={'period': 20}))
registry.register(FeatureDefinition('rsi', 'rsi', 14, params={'period': 14}))
registry.register(FeatureDefinition('macd', 'macd', 26, params={'fast': 12, 'slow': 26}))
registry.register(FeatureDefinition('price_to_sma_20', 'price_to_sma', 20, params={'period': 20}))

# Create pipeline
pipeline = FeaturePipeline(registry)

# Fit on training data
train_df = df.iloc[:800]
test_df = df.iloc[800:]

pipeline.fit(train_df)

print(f"Feature hash: {registry.get_feature_hash()}")
print(f"Max lookback: {registry.get_max_lookback()} days")
print(f"Computation order: {registry.computation_order}")

Section 2: Model Versioning and Deployment

Proper model versioning ensures reproducibility and enables rollback if needed.

# Model Versioning System
@dataclass
class ModelVersion:
    """Represents a versioned model."""
    version_id: str
    model_type: str
    feature_hash: str
    created_at: datetime
    metrics: Dict[str, float]
    hyperparameters: Dict[str, Any]
    is_active: bool = False
    
    def to_dict(self) -> Dict:
        return {
            'version_id': self.version_id,
            'model_type': self.model_type,
            'feature_hash': self.feature_hash,
            'created_at': self.created_at.isoformat(),
            'metrics': self.metrics,
            'hyperparameters': self.hyperparameters,
            'is_active': self.is_active
        }


class ModelRegistry:
    """Registry for model versions."""
    
    def __init__(self):
        self.versions: Dict[str, ModelVersion] = {}
        self.models: Dict[str, Any] = {}
        self.active_version: Optional[str] = None
        
    def register_model(self, model, version: ModelVersion):
        """Register a new model version."""
        self.versions[version.version_id] = version
        self.models[version.version_id] = model
        print(f"Registered model version: {version.version_id}")
        
    def activate_version(self, version_id: str):
        """Activate a specific model version."""
        if version_id not in self.versions:
            raise ValueError(f"Version {version_id} not found")
        
        # Deactivate current
        if self.active_version:
            self.versions[self.active_version].is_active = False
        
        # Activate new
        self.versions[version_id].is_active = True
        self.active_version = version_id
        print(f"Activated version: {version_id}")
        
    def get_active_model(self):
        """Get the currently active model."""
        if not self.active_version:
            raise ValueError("No active model version")
        return self.models[self.active_version]
    
    def get_version_history(self) -> pd.DataFrame:
        """Get version history as dataframe."""
        records = [v.to_dict() for v in self.versions.values()]
        return pd.DataFrame(records)
    
    def rollback(self, version_id: str):
        """Rollback to a previous version."""
        if version_id not in self.versions:
            raise ValueError(f"Version {version_id} not found")
        
        self.activate_version(version_id)
        print(f"Rolled back to version: {version_id}")
        
    def compare_versions(self, version_ids: List[str]) -> pd.DataFrame:
        """Compare metrics across versions."""
        comparisons = []
        for vid in version_ids:
            if vid in self.versions:
                v = self.versions[vid]
                record = {'version_id': vid, **v.metrics}
                comparisons.append(record)
        return pd.DataFrame(comparisons)

print("Model versioning system defined")
# Model Deployment Manager
class ModelDeploymentManager:
    """Manages model deployment lifecycle."""
    
    def __init__(self, model_registry: ModelRegistry, 
                 feature_pipeline: FeaturePipeline):
        self.model_registry = model_registry
        self.feature_pipeline = feature_pipeline
        self.deployment_history = []
        
    def train_new_version(self, train_data: pd.DataFrame, 
                          model_class, hyperparameters: Dict,
                          target_col: str = 'target') -> str:
        """Train and register a new model version."""
        # Prepare features
        features_df = self.feature_pipeline.fit_transform(train_data)
        feature_cols = list(self.feature_pipeline.registry.features.keys())
        
        # Prepare target
        features_df['target'] = (features_df['close'].shift(-1) > 
                                  features_df['close']).astype(int)
        
        # Remove NaN
        valid_data = features_df.dropna()
        X = valid_data[feature_cols].values
        y = valid_data[target_col].values
        
        # Train model
        model = model_class(**hyperparameters)
        model.fit(X, y)
        
        # Calculate metrics
        train_pred = model.predict(X)
        accuracy = (train_pred == y).mean()
        
        # Create version
        version_id = f"v_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        version = ModelVersion(
            version_id=version_id,
            model_type=model_class.__name__,
            feature_hash=self.feature_pipeline.registry.get_feature_hash(),
            created_at=datetime.now(),
            metrics={'train_accuracy': accuracy},
            hyperparameters=hyperparameters
        )
        
        # Register
        self.model_registry.register_model(model, version)
        
        return version_id
    
    def validate_before_deploy(self, version_id: str, 
                                validation_data: pd.DataFrame,
                                min_accuracy: float = 0.5) -> bool:
        """Validate model before deployment."""
        model = self.model_registry.models[version_id]
        
        # Prepare validation data
        features_df = self.feature_pipeline.transform(validation_data)
        feature_cols = list(self.feature_pipeline.registry.features.keys())
        
        features_df['target'] = (features_df['close'].shift(-1) > 
                                  features_df['close']).astype(int)
        
        valid_data = features_df.dropna()
        X = valid_data[feature_cols].values
        y = valid_data['target'].values
        
        # Validate
        predictions = model.predict(X)
        accuracy = (predictions == y).mean()
        
        # Update version metrics
        self.model_registry.versions[version_id].metrics['val_accuracy'] = accuracy
        
        is_valid = accuracy >= min_accuracy
        print(f"Validation accuracy: {accuracy:.4f} - {'PASSED' if is_valid else 'FAILED'}")
        
        return is_valid
    
    def deploy(self, version_id: str, force: bool = False):
        """Deploy a model version."""
        if not force:
            val_acc = self.model_registry.versions[version_id].metrics.get('val_accuracy')
            if val_acc is None:
                raise ValueError("Model not validated. Run validate_before_deploy() first.")
        
        self.model_registry.activate_version(version_id)
        
        self.deployment_history.append({
            'version_id': version_id,
            'deployed_at': datetime.now(),
            'action': 'deploy'
        })
        
        print(f"Deployed version: {version_id}")
        
    def predict(self, data: pd.DataFrame) -> np.ndarray:
        """Make predictions using active model."""
        model = self.model_registry.get_active_model()
        features_df = self.feature_pipeline.transform(data)
        feature_cols = list(self.feature_pipeline.registry.features.keys())
        
        X = features_df[feature_cols].dropna().values
        return model.predict(X)

print("ModelDeploymentManager defined")
# Test deployment workflow
model_registry = ModelRegistry()
deployment_manager = ModelDeploymentManager(model_registry, pipeline)

# Train first version
v1_id = deployment_manager.train_new_version(
    train_data=train_df,
    model_class=RandomForestClassifier,
    hyperparameters={'n_estimators': 50, 'max_depth': 3, 'random_state': 42}
)

# Validate
is_valid = deployment_manager.validate_before_deploy(v1_id, test_df)

# Deploy if valid
if is_valid:
    deployment_manager.deploy(v1_id)

# Show version history
print("\nVersion History:")
print(model_registry.get_version_history())

Section 3: Model Monitoring

Continuous monitoring detects model degradation and data drift.

# Model Monitoring System
class ModelMonitor:
    """Monitor model performance and data drift."""
    
    def __init__(self, feature_pipeline: FeaturePipeline,
                 alert_threshold: float = 0.1):
        self.feature_pipeline = feature_pipeline
        self.alert_threshold = alert_threshold
        self.prediction_log = []
        self.performance_history = []
        self.drift_alerts = []
        
    def log_prediction(self, timestamp: datetime, features: Dict,
                       prediction: int, probability: float,
                       actual: Optional[int] = None):
        """Log a prediction for monitoring."""
        self.prediction_log.append({
            'timestamp': timestamp,
            'features': features,
            'prediction': prediction,
            'probability': probability,
            'actual': actual
        })
        
    def update_actual(self, timestamp: datetime, actual: int):
        """Update actual outcome for a prediction."""
        for log in self.prediction_log:
            if log['timestamp'] == timestamp:
                log['actual'] = actual
                break
                
    def calculate_rolling_accuracy(self, window: int = 20) -> float:
        """Calculate rolling accuracy."""
        recent_logs = [l for l in self.prediction_log[-window:] 
                       if l['actual'] is not None]
        
        if not recent_logs:
            return None
        
        correct = sum(1 for l in recent_logs if l['prediction'] == l['actual'])
        return correct / len(recent_logs)
    
    def detect_feature_drift(self, current_data: pd.DataFrame) -> Dict:
        """Detect drift in feature distributions."""
        drift_results = {}
        
        validation = self.feature_pipeline.validate_features(current_data)
        
        for feature_name, stats in validation.items():
            drift_results[feature_name] = {
                'z_score': stats['z_score'],
                'is_drifted': stats['is_outlier']
            }
            
            if stats['is_outlier']:
                self.drift_alerts.append({
                    'timestamp': datetime.now(),
                    'feature': feature_name,
                    'z_score': stats['z_score']
                })
        
        return drift_results
    
    def detect_prediction_drift(self, window: int = 100) -> Dict:
        """Detect drift in prediction distribution."""
        recent_logs = self.prediction_log[-window:]
        
        if len(recent_logs) < window // 2:
            return {'status': 'insufficient_data'}
        
        # Calculate prediction distribution
        predictions = [l['prediction'] for l in recent_logs]
        probabilities = [l['probability'] for l in recent_logs]
        
        # Split into first and second half
        mid = len(predictions) // 2
        first_half_mean = np.mean(predictions[:mid])
        second_half_mean = np.mean(predictions[mid:])
        
        drift_score = abs(second_half_mean - first_half_mean)
        
        return {
            'status': 'ok' if drift_score < self.alert_threshold else 'drift_detected',
            'drift_score': drift_score,
            'first_half_mean': first_half_mean,
            'second_half_mean': second_half_mean
        }
    
    def check_performance_degradation(self, baseline_accuracy: float,
                                       window: int = 50) -> Dict:
        """Check for performance degradation."""
        current_accuracy = self.calculate_rolling_accuracy(window)
        
        if current_accuracy is None:
            return {'status': 'insufficient_data'}
        
        degradation = baseline_accuracy - current_accuracy
        
        return {
            'status': 'ok' if degradation < self.alert_threshold else 'degraded',
            'baseline_accuracy': baseline_accuracy,
            'current_accuracy': current_accuracy,
            'degradation': degradation
        }
    
    def generate_monitoring_report(self) -> Dict:
        """Generate comprehensive monitoring report."""
        return {
            'timestamp': datetime.now(),
            'total_predictions': len(self.prediction_log),
            'predictions_with_actual': sum(1 for l in self.prediction_log 
                                           if l['actual'] is not None),
            'rolling_accuracy_20': self.calculate_rolling_accuracy(20),
            'rolling_accuracy_50': self.calculate_rolling_accuracy(50),
            'drift_alerts_count': len(self.drift_alerts),
            'recent_drift_alerts': self.drift_alerts[-5:]
        }

print("ModelMonitor class defined")
# Alert System
class AlertManager:
    """Manage alerts for model monitoring."""
    
    def __init__(self):
        self.alerts = []
        self.alert_handlers = []
        
    def register_handler(self, handler_func):
        """Register an alert handler function."""
        self.alert_handlers.append(handler_func)
        
    def raise_alert(self, alert_type: str, severity: str,
                    message: str, details: Dict = None):
        """Raise an alert."""
        alert = {
            'timestamp': datetime.now(),
            'type': alert_type,
            'severity': severity,
            'message': message,
            'details': details or {}
        }
        
        self.alerts.append(alert)
        
        # Trigger handlers
        for handler in self.alert_handlers:
            handler(alert)
            
        print(f"[{severity.upper()}] {alert_type}: {message}")
        
    def get_alerts(self, severity: Optional[str] = None,
                   since: Optional[datetime] = None) -> List[Dict]:
        """Get alerts with optional filtering."""
        filtered = self.alerts
        
        if severity:
            filtered = [a for a in filtered if a['severity'] == severity]
        
        if since:
            filtered = [a for a in filtered if a['timestamp'] >= since]
        
        return filtered

# Example handler
def print_handler(alert):
    if alert['severity'] == 'critical':
        print(f"!!! CRITICAL ALERT: {alert['message']} !!!")

alert_manager = AlertManager()
alert_manager.register_handler(print_handler)

print("AlertManager configured")
# Simulate monitoring
monitor = ModelMonitor(pipeline)

# Simulate predictions
np.random.seed(42)
for i in range(100):
    # Simulate prediction
    prediction = np.random.choice([0, 1])
    probability = 0.5 + np.random.uniform(-0.3, 0.3)
    actual = np.random.choice([0, 1])
    
    monitor.log_prediction(
        timestamp=datetime.now() + timedelta(days=i),
        features={'return_1d': np.random.normal(0, 0.02)},
        prediction=prediction,
        probability=probability,
        actual=actual
    )

# Generate report
report = monitor.generate_monitoring_report()
print("\nMonitoring Report:")
for key, value in report.items():
    if key != 'recent_drift_alerts':
        print(f"  {key}: {value}")

Section 4: Error Handling and Fallbacks

Production systems need robust error handling and graceful degradation.

# Production Prediction Service
class PredictionService:
    """Production-ready prediction service with fallbacks."""
    
    def __init__(self, deployment_manager: ModelDeploymentManager,
                 monitor: ModelMonitor,
                 alert_manager: AlertManager):
        self.deployment_manager = deployment_manager
        self.monitor = monitor
        self.alert_manager = alert_manager
        self.fallback_prediction = 0  # Conservative: no position
        self.request_count = 0
        self.error_count = 0
        
    def predict(self, data: pd.DataFrame) -> Dict:
        """Make prediction with error handling."""
        self.request_count += 1
        result = {
            'timestamp': datetime.now(),
            'status': 'success',
            'prediction': None,
            'probability': None,
            'is_fallback': False,
            'warnings': []
        }
        
        try:
            # Validate input data
            if len(data) < self.deployment_manager.feature_pipeline.registry.get_max_lookback():
                result['warnings'].append('Insufficient data for full lookback')
            
            # Check for data quality
            if data['close'].isna().any():
                raise ValueError("Missing price data")
            
            # Detect feature drift
            drift_results = self.monitor.detect_feature_drift(data)
            drifted_features = [f for f, d in drift_results.items() if d['is_drifted']]
            
            if drifted_features:
                result['warnings'].append(f"Feature drift detected: {drifted_features}")
                self.alert_manager.raise_alert(
                    'feature_drift', 'warning',
                    f"Drift detected in features: {drifted_features}"
                )
            
            # Make prediction
            model = self.deployment_manager.model_registry.get_active_model()
            features_df = self.deployment_manager.feature_pipeline.transform(data)
            feature_cols = list(self.deployment_manager.feature_pipeline.registry.features.keys())
            
            X = features_df[feature_cols].iloc[-1:].values
            
            if np.isnan(X).any():
                raise ValueError("NaN values in features")
            
            prediction = model.predict(X)[0]
            probability = model.predict_proba(X)[0, 1]
            
            result['prediction'] = int(prediction)
            result['probability'] = float(probability)
            
            # Log prediction
            self.monitor.log_prediction(
                timestamp=result['timestamp'],
                features=dict(zip(feature_cols, X[0])),
                prediction=prediction,
                probability=probability
            )
            
        except Exception as e:
            self.error_count += 1
            result['status'] = 'fallback'
            result['prediction'] = self.fallback_prediction
            result['probability'] = 0.5
            result['is_fallback'] = True
            result['error'] = str(e)
            
            # Alert on errors
            error_rate = self.error_count / self.request_count
            if error_rate > 0.1:
                self.alert_manager.raise_alert(
                    'high_error_rate', 'critical',
                    f"Error rate: {error_rate:.2%}",
                    {'error_count': self.error_count, 'request_count': self.request_count}
                )
        
        return result
    
    def health_check(self) -> Dict:
        """Check service health."""
        return {
            'status': 'healthy' if self.error_count / max(1, self.request_count) < 0.1 else 'degraded',
            'request_count': self.request_count,
            'error_count': self.error_count,
            'error_rate': self.error_count / max(1, self.request_count),
            'active_model': self.deployment_manager.model_registry.active_version
        }

print("PredictionService class defined")
# Test prediction service
service = PredictionService(deployment_manager, monitor, alert_manager)

# Normal prediction
result = service.predict(test_df)
print("\nPrediction Result:")
for key, value in result.items():
    print(f"  {key}: {value}")

# Health check
print("\nHealth Check:")
health = service.health_check()
for key, value in health.items():
    print(f"  {key}: {value}")

Section 5: Module Project - Complete Production System

Build a complete production ML trading system.

# Complete Production ML Trading System
class ProductionTradingSystem:
    """Complete production ML trading system."""
    
    def __init__(self, initial_capital: float = 100000):
        self.initial_capital = initial_capital
        self.capital = initial_capital
        self.position = 0
        
        # Initialize components
        self.feature_registry = FeatureRegistry()
        self._setup_features()
        
        self.feature_pipeline = FeaturePipeline(self.feature_registry)
        self.model_registry = ModelRegistry()
        self.deployment_manager = ModelDeploymentManager(
            self.model_registry, self.feature_pipeline
        )
        self.monitor = ModelMonitor(self.feature_pipeline)
        self.alert_manager = AlertManager()
        self.prediction_service = PredictionService(
            self.deployment_manager, self.monitor, self.alert_manager
        )
        
        # Trading state
        self.trade_history = []
        self.equity_curve = []
        
    def _setup_features(self):
        """Setup standard feature set."""
        features = [
            FeatureDefinition('return_1d', 'return', 1, params={'period': 1}),
            FeatureDefinition('return_5d', 'return', 5, params={'period': 5}),
            FeatureDefinition('return_20d', 'return', 20, params={'period': 20}),
            FeatureDefinition('volatility_20d', 'volatility', 20, params={'period': 20}),
            FeatureDefinition('rsi', 'rsi', 14, params={'period': 14}),
            FeatureDefinition('macd', 'macd', 26, params={'fast': 12, 'slow': 26}),
            FeatureDefinition('price_to_sma_20', 'price_to_sma', 20, params={'period': 20}),
            FeatureDefinition('price_to_sma_50', 'price_to_sma', 50, params={'period': 50}),
        ]
        
        for feature in features:
            self.feature_registry.register(feature)
    
    def train(self, train_data: pd.DataFrame, 
              model_class=RandomForestClassifier,
              hyperparameters: Dict = None):
        """Train and deploy a model."""
        if hyperparameters is None:
            hyperparameters = {
                'n_estimators': 100,
                'max_depth': 5,
                'random_state': 42
            }
        
        # Train new version
        version_id = self.deployment_manager.train_new_version(
            train_data, model_class, hyperparameters
        )
        
        return version_id
    
    def validate_and_deploy(self, version_id: str, 
                            validation_data: pd.DataFrame,
                            min_accuracy: float = 0.5):
        """Validate and deploy a model version."""
        is_valid = self.deployment_manager.validate_before_deploy(
            version_id, validation_data, min_accuracy
        )
        
        if is_valid:
            self.deployment_manager.deploy(version_id)
            return True
        else:
            self.alert_manager.raise_alert(
                'validation_failed', 'warning',
                f"Model {version_id} failed validation"
            )
            return False
    
    def process_bar(self, current_data: pd.DataFrame, 
                    current_price: float) -> Dict:
        """Process a new bar and potentially trade."""
        # Get prediction
        prediction_result = self.prediction_service.predict(current_data)
        
        # Determine position
        signal = 1 if prediction_result['prediction'] == 1 else -1
        
        trade_result = None
        
        # Check for position change
        if signal != self.position:
            trade_result = self._execute_trade(signal, current_price)
        
        # Update equity
        self.equity_curve.append({
            'timestamp': datetime.now(),
            'capital': self.capital,
            'position': self.position
        })
        
        return {
            'prediction': prediction_result,
            'signal': signal,
            'trade': trade_result,
            'capital': self.capital,
            'position': self.position
        }
    
    def _execute_trade(self, new_position: int, price: float) -> Dict:
        """Execute a trade."""
        # Calculate trade cost (0.1% commission + 0.05% slippage)
        cost_rate = 0.0015
        position_change = abs(new_position - self.position)
        trade_cost = self.capital * position_change * cost_rate
        
        self.capital -= trade_cost
        self.position = new_position
        
        trade = {
            'timestamp': datetime.now(),
            'price': price,
            'new_position': new_position,
            'cost': trade_cost
        }
        
        self.trade_history.append(trade)
        
        return trade
    
    def update_pnl(self, price_return: float):
        """Update P&L based on position and return."""
        pnl = self.capital * self.position * price_return
        self.capital += pnl
        return pnl
    
    def get_performance_summary(self) -> Dict:
        """Get performance summary."""
        if not self.equity_curve:
            return {'status': 'no_data'}
        
        equity_df = pd.DataFrame(self.equity_curve)
        capitals = equity_df['capital'].values
        
        returns = np.diff(capitals) / capitals[:-1]
        
        return {
            'total_return': (self.capital / self.initial_capital) - 1,
            'sharpe_ratio': np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8) if len(returns) > 0 else 0,
            'max_drawdown': (capitals / np.maximum.accumulate(capitals) - 1).min() if len(capitals) > 0 else 0,
            'n_trades': len(self.trade_history),
            'total_costs': sum(t['cost'] for t in self.trade_history),
            'active_model': self.model_registry.active_version,
            'health': self.prediction_service.health_check()
        }
    
    def generate_system_report(self) -> str:
        """Generate comprehensive system report."""
        perf = self.get_performance_summary()
        monitoring = self.monitor.generate_monitoring_report()
        
        report = f"""
========================================
PRODUCTION TRADING SYSTEM REPORT
========================================
Generated: {datetime.now()}

--- Performance ---
Total Return: {perf.get('total_return', 0):.2%}
Sharpe Ratio: {perf.get('sharpe_ratio', 0):.2f}
Max Drawdown: {perf.get('max_drawdown', 0):.2%}
Number of Trades: {perf.get('n_trades', 0)}
Total Costs: ${perf.get('total_costs', 0):,.2f}

--- Model ---
Active Model: {perf.get('active_model', 'None')}
Feature Hash: {self.feature_registry.get_feature_hash()}

--- Monitoring ---
Total Predictions: {monitoring.get('total_predictions', 0)}
Rolling Accuracy (20): {monitoring.get('rolling_accuracy_20', 'N/A')}
Drift Alerts: {monitoring.get('drift_alerts_count', 0)}

--- Health ---
Status: {perf.get('health', {}).get('status', 'Unknown')}
Error Rate: {perf.get('health', {}).get('error_rate', 0):.2%}
========================================
"""
        return report

print("ProductionTradingSystem class defined")
# Run complete production system
# Generate more data
full_data = generate_sample_data(1500)

# Split data
train_data = full_data.iloc[:1000]
val_data = full_data.iloc[1000:1200]
test_data = full_data.iloc[1200:]

# Initialize system
system = ProductionTradingSystem(initial_capital=100000)

# Train model
print("Training model...")
version_id = system.train(train_data)

# Validate and deploy
print("\nValidating and deploying...")
deployed = system.validate_and_deploy(version_id, val_data, min_accuracy=0.45)
# Simulate live trading
print("\nSimulating live trading...")

lookback = 100  # Days of history needed

for i in range(lookback, len(test_data)):
    # Get current data window
    current_data = test_data.iloc[i-lookback:i+1]
    current_price = test_data.iloc[i]['close']
    
    # Process bar
    result = system.process_bar(current_data, current_price)
    
    # Update P&L if we have previous price
    if i > lookback:
        prev_price = test_data.iloc[i-1]['close']
        price_return = (current_price - prev_price) / prev_price
        system.update_pnl(price_return)

# Generate final report
print(system.generate_system_report())
# Visualize system performance
if system.equity_curve:
    equity_df = pd.DataFrame(system.equity_curve)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Equity curve
    axes[0, 0].plot(range(len(equity_df)), equity_df['capital'])
    axes[0, 0].set_xlabel('Time Step')
    axes[0, 0].set_ylabel('Capital ($)')
    axes[0, 0].set_title('Equity Curve')
    
    # Position over time
    axes[0, 1].step(range(len(equity_df)), equity_df['position'], where='post')
    axes[0, 1].set_xlabel('Time Step')
    axes[0, 1].set_ylabel('Position')
    axes[0, 1].set_title('Position Over Time')
    
    # Trade costs
    if system.trade_history:
        costs = [t['cost'] for t in system.trade_history]
        axes[1, 0].bar(range(len(costs)), costs)
        axes[1, 0].set_xlabel('Trade Number')
        axes[1, 0].set_ylabel('Cost ($)')
        axes[1, 0].set_title('Trade Costs')
    
    # Drawdown
    capitals = equity_df['capital'].values
    peak = np.maximum.accumulate(capitals)
    drawdown = (capitals - peak) / peak
    axes[1, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
    axes[1, 1].set_xlabel('Time Step')
    axes[1, 1].set_ylabel('Drawdown')
    axes[1, 1].set_title('Drawdown')
    
    plt.tight_layout()
    plt.show()

Exercises

Complete the following exercises to practice production ML systems.

Exercise 13.1: Create Feature Definition (Guided)

Define a new feature for the pipeline.

Exercise
Solution 13.1
def create_bollinger_band_feature():
    feature = FeatureDefinition(
        name='bb_width',
        feature_type='volatility',
        lookback_periods=20,
        params={'period': 20, 'std_dev': 2}
    )

    return feature

Exercise 13.2: Implement Model Version Comparison (Guided)

Create a function to compare model versions.

Exercise
Solution 13.2
def compare_model_versions(registry: ModelRegistry,
                           version_ids: List[str]) -> pd.DataFrame:
    records = []

    for version_id in version_ids:
        # Get version from registry
        if version_id in registry.versions:
            version = registry.versions[version_id]

            # Create record dict
            record = {
                'version_id': version.version_id,
                'model_type': version.model_type,
                'created_at': version.created_at,
                'is_active': version.is_active
            }

            # Add all metrics
            for metric_name, metric_value in version.metrics.items():
                record[metric_name] = metric_value

            records.append(record)

    return pd.DataFrame(records)

Exercise 13.3: Implement Drift Detection (Guided)

Create a simple drift detection function.

Exercise
Solution 13.3
def detect_distribution_drift(reference_data: np.ndarray,
                               current_data: np.ndarray,
                               threshold: float = 0.1) -> Dict:
    # Calculate reference statistics
    ref_mean = np.mean(reference_data)
    ref_std = np.std(reference_data)

    # Calculate current statistics
    curr_mean = np.mean(current_data)
    curr_std = np.std(current_data)

    # Calculate drift metrics
    mean_drift = abs(curr_mean - ref_mean) / (ref_std + 1e-10)
    std_drift = abs(curr_std - ref_std) / (ref_std + 1e-10)

    # Determine if drifted
    is_drifted = mean_drift > threshold or std_drift > threshold

    return {
        'mean_drift': mean_drift,
        'std_drift': std_drift,
        'is_drifted': is_drifted,
        'ref_mean': ref_mean,
        'curr_mean': curr_mean
    }

Exercise 13.4: Build Feature Store (Open-ended)

Create a simple feature store for caching computed features.

Exercise
Solution 13.4
class FeatureStore:
    def __init__(self, ttl_hours: int = 24):
        self.cache = {}  # {date: {feature_name: value}}
        self.timestamps = {}  # {date: cache_time}
        self.ttl = timedelta(hours=ttl_hours)

    def put(self, date: datetime, features: Dict[str, float]):
        """Store features for a date."""
        date_key = date.date() if isinstance(date, datetime) else date
        self.cache[date_key] = features
        self.timestamps[date_key] = datetime.now()

    def get(self, date: datetime, feature_names: List[str] = None) -> Optional[Dict]:
        """Get features for a date (point-in-time lookup)."""
        date_key = date.date() if isinstance(date, datetime) else date

        if date_key not in self.cache:
            return None

        # Check TTL
        if datetime.now() - self.timestamps[date_key] > self.ttl:
            del self.cache[date_key]
            del self.timestamps[date_key]
            return None

        features = self.cache[date_key]

        if feature_names:
            return {k: v for k, v in features.items() if k in feature_names}
        return features

    def get_range(self, start_date: datetime, end_date: datetime) -> pd.DataFrame:
        """Get features for a date range."""
        records = []
        current = start_date

        while current <= end_date:
            features = self.get(current)
            if features:
                records.append({'date': current, **features})
            current += timedelta(days=1)

        return pd.DataFrame(records)

    def save(self, filepath: str):
        """Save feature store to disk."""
        with open(filepath, 'wb') as f:
            pickle.dump({'cache': self.cache, 'timestamps': self.timestamps}, f)

    @classmethod
    def load(cls, filepath: str) -> 'FeatureStore':
        """Load feature store from disk."""
        with open(filepath, 'rb') as f:
            data = pickle.load(f)

        store = cls()
        store.cache = data['cache']
        store.timestamps = data['timestamps']
        return store

    def cleanup_expired(self):
        """Remove expired entries."""
        now = datetime.now()
        expired = [k for k, v in self.timestamps.items() if now - v > self.ttl]
        for k in expired:
            del self.cache[k]
            del self.timestamps[k]
        return len(expired)

# Usage
store = FeatureStore(ttl_hours=24)
store.put(datetime.now(), {'return_1d': 0.01, 'rsi': 55})
print(store.get(datetime.now()))

Exercise 13.5: Implement A/B Testing Framework (Open-ended)

Create a framework for A/B testing model versions.

Exercise
Solution 13.5
class ABTestManager:
    def __init__(self, model_a, model_b, split_ratio=0.5):
        self.model_a = model_a
        self.model_b = model_b
        self.split_ratio = split_ratio
        self.results_a = []
        self.results_b = []

    def predict(self, X):
        """Route prediction to A or B based on split."""
        if np.random.random() < self.split_ratio:
            return 'A', self.model_a.predict(X)
        else:
            return 'B', self.model_b.predict(X)

    def record_outcome(self, model_id: str, prediction: int, actual: int):
        """Record prediction outcome."""
        is_correct = prediction == actual
        if model_id == 'A':
            self.results_a.append(is_correct)
        else:
            self.results_b.append(is_correct)

    def get_performance(self) -> Dict:
        """Get performance comparison."""
        acc_a = np.mean(self.results_a) if self.results_a else 0
        acc_b = np.mean(self.results_b) if self.results_b else 0

        return {
            'model_a': {'accuracy': acc_a, 'n_samples': len(self.results_a)},
            'model_b': {'accuracy': acc_b, 'n_samples': len(self.results_b)}
        }

    def is_significant(self, confidence=0.95) -> Dict:
        """Check if difference is statistically significant."""
        if len(self.results_a) < 30 or len(self.results_b) < 30:
            return {'significant': False, 'reason': 'insufficient_samples'}

        # Two-proportion z-test
        p_a = np.mean(self.results_a)
        p_b = np.mean(self.results_b)
        n_a = len(self.results_a)
        n_b = len(self.results_b)

        p_pooled = (p_a * n_a + p_b * n_b) / (n_a + n_b)
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))

        z = (p_a - p_b) / (se + 1e-10)

        # Critical value for 95% confidence
        z_critical = 1.96

        return {
            'significant': abs(z) > z_critical,
            'z_score': z,
            'winner': 'A' if z > z_critical else ('B' if z < -z_critical else 'tie')
        }

    def get_recommendation(self) -> str:
        """Get deployment recommendation."""
        perf = self.get_performance()
        sig = self.is_significant()

        if not sig['significant']:
            return "Continue testing - no significant difference yet"

        winner = sig['winner']
        return f"Deploy Model {winner} - statistically significant improvement"

# Usage
# ab_test = ABTestManager(model_v1, model_v2)

Exercise 13.6: Create Automated Retraining Pipeline (Open-ended)

Build an automated pipeline that retrains models when performance degrades.

Exercise
Solution 13.6
class AutoRetrainer:
    def __init__(self, deployment_manager: ModelDeploymentManager,
                 monitor: ModelMonitor,
                 accuracy_threshold: float = 0.48,
                 retrain_window: int = 252):
        self.deployment_manager = deployment_manager
        self.monitor = monitor
        self.accuracy_threshold = accuracy_threshold
        self.retrain_window = retrain_window
        self.retrain_history = []
        self.last_retrain = None
        self.min_retrain_interval = timedelta(days=7)

    def check_retrain_needed(self) -> bool:
        """Check if retraining is needed."""
        current_accuracy = self.monitor.calculate_rolling_accuracy(50)

        if current_accuracy is None:
            return False

        # Check if enough time since last retrain
        if self.last_retrain:
            if datetime.now() - self.last_retrain < self.min_retrain_interval:
                return False

        return current_accuracy < self.accuracy_threshold

    def retrain(self, recent_data: pd.DataFrame,
                model_class=RandomForestClassifier,
                hyperparameters: Dict = None) -> Optional[str]:
        """Retrain model on recent data."""
        if not self.check_retrain_needed():
            return None

        if hyperparameters is None:
            hyperparameters = {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}

        # Use recent data for training
        train_data = recent_data.iloc[-self.retrain_window:]

        # Train new version
        version_id = self.deployment_manager.train_new_version(
            train_data, model_class, hyperparameters
        )

        # Log retraining
        self.retrain_history.append({
            'timestamp': datetime.now(),
            'version_id': version_id,
            'trigger_accuracy': self.monitor.calculate_rolling_accuracy(50)
        })

        self.last_retrain = datetime.now()

        return version_id

    def auto_deploy(self, version_id: str,
                    validation_data: pd.DataFrame,
                    min_improvement: float = 0.02) -> bool:
        """Automatically deploy if new model is better."""
        # Validate new model
        is_valid = self.deployment_manager.validate_before_deploy(
            version_id, validation_data
        )

        if not is_valid:
            return False

        # Check if improvement is significant
        current_accuracy = self.monitor.calculate_rolling_accuracy(50) or 0
        new_accuracy = self.deployment_manager.model_registry.versions[version_id].metrics.get('val_accuracy', 0)

        if new_accuracy - current_accuracy >= min_improvement:
            self.deployment_manager.deploy(version_id)
            return True

        return False

    def run_auto_retrain_cycle(self, recent_data: pd.DataFrame,
                                validation_data: pd.DataFrame) -> Dict:
        """Run complete auto-retrain cycle."""
        result = {
            'retrain_needed': self.check_retrain_needed(),
            'retrained': False,
            'deployed': False
        }

        if result['retrain_needed']:
            version_id = self.retrain(recent_data)
            if version_id:
                result['retrained'] = True
                result['version_id'] = version_id
                result['deployed'] = self.auto_deploy(version_id, validation_data)

        return result

# Usage
# auto_retrainer = AutoRetrainer(deployment_manager, monitor)
# result = auto_retrainer.run_auto_retrain_cycle(recent_data, val_data)

Summary

In this module, you learned:

  1. Feature Pipelines: Building robust, versioned feature computation systems

  2. Model Versioning: Managing model versions with proper metadata and rollback capability

  3. Model Monitoring: Detecting drift, degradation, and anomalies in production

  4. Error Handling: Building resilient systems with fallbacks and alerts

  5. Production Systems: Integrating all components into a complete trading system

Key Takeaways

  • Feature pipelines must be consistent between training and inference
  • Model versioning enables reproducibility and safe rollbacks
  • Continuous monitoring catches problems before they cause losses
  • Fallback mechanisms ensure the system degrades gracefully
  • Production systems require more engineering than research systems

Next Steps

In Module 14, you'll explore advanced ML topics including reinforcement learning, online learning, and ensemble methods for finance.

Module 14: Advanced ML Topics

Overview

This module explores cutting-edge ML techniques for finance including reinforcement learning, online learning, and advanced ensemble methods. These techniques address unique challenges in financial markets.

Learning Objectives

By the end of this module, you will be able to: - Apply reinforcement learning to portfolio optimization - Implement online learning for adapting to market changes - Build advanced ensemble methods for improved predictions - Understand meta-learning approaches for finance

Prerequisites

  • Module 11: Deep Learning for Finance
  • Module 12: Backtesting ML Strategies
  • Module 13: Production ML Systems

Estimated Time: 4 hours


Section 1: Reinforcement Learning for Trading

Reinforcement learning (RL) frames trading as a sequential decision problem where an agent learns to maximize cumulative rewards.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from typing import Dict, List, Tuple, Any, Optional
from dataclasses import dataclass
from collections import deque
import random
from abc import ABC, abstractmethod
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("Advanced ML libraries loaded")
# Trading Environment for RL
class TradingEnvironment:
    """RL environment for trading."""
    
    def __init__(self, df: pd.DataFrame, initial_capital: float = 100000,
                 commission: float = 0.001, window_size: int = 20):
        self.df = df.reset_index(drop=True)
        self.initial_capital = initial_capital
        self.commission = commission
        self.window_size = window_size
        
        # State variables
        self.current_step = None
        self.capital = None
        self.position = None  # -1, 0, 1
        self.entry_price = None
        
        # Actions: 0=hold, 1=buy, 2=sell
        self.action_space = 3
        
    def reset(self) -> np.ndarray:
        """Reset environment to initial state."""
        self.current_step = self.window_size
        self.capital = self.initial_capital
        self.position = 0
        self.entry_price = None
        
        return self._get_state()
    
    def _get_state(self) -> np.ndarray:
        """Get current state observation."""
        # Price-based features
        window = self.df.iloc[self.current_step - self.window_size:self.current_step]
        
        # Normalized returns
        returns = window['close'].pct_change().fillna(0).values
        
        # Position encoding
        position_encoding = np.array([self.position])
        
        # PnL if in position
        if self.position != 0 and self.entry_price:
            unrealized_pnl = (self.df.iloc[self.current_step]['close'] - self.entry_price) / self.entry_price
            unrealized_pnl = np.array([unrealized_pnl * self.position])
        else:
            unrealized_pnl = np.array([0.0])
        
        state = np.concatenate([returns, position_encoding, unrealized_pnl])
        return state.astype(np.float32)
    
    def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
        """Take action and return next state, reward, done, info."""
        current_price = self.df.iloc[self.current_step]['close']
        reward = 0
        
        # Execute action
        if action == 1:  # Buy
            if self.position <= 0:
                # Close short if any
                if self.position == -1:
                    pnl = (self.entry_price - current_price) / self.entry_price
                    self.capital *= (1 + pnl - self.commission)
                
                # Open long
                self.position = 1
                self.entry_price = current_price
                self.capital *= (1 - self.commission)
                
        elif action == 2:  # Sell
            if self.position >= 0:
                # Close long if any
                if self.position == 1:
                    pnl = (current_price - self.entry_price) / self.entry_price
                    self.capital *= (1 + pnl - self.commission)
                
                # Open short
                self.position = -1
                self.entry_price = current_price
                self.capital *= (1 - self.commission)
        
        # Move to next step
        self.current_step += 1
        
        # Calculate reward (daily return of position)
        if self.current_step < len(self.df):
            next_price = self.df.iloc[self.current_step]['close']
            price_return = (next_price - current_price) / current_price
            reward = self.position * price_return
        
        # Check if done
        done = self.current_step >= len(self.df) - 1
        
        # Get next state
        next_state = self._get_state() if not done else np.zeros(self.window_size + 2)
        
        info = {
            'capital': self.capital,
            'position': self.position,
            'step': self.current_step
        }
        
        return next_state, reward, done, info
    
    @property
    def state_size(self) -> int:
        """Get state dimension."""
        return self.window_size + 2  # returns + position + unrealized pnl

print("TradingEnvironment class defined")
# Q-Learning Agent
class QLearningAgent:
    """Simple Q-learning agent for trading."""
    
    def __init__(self, state_size: int, action_size: int,
                 learning_rate: float = 0.01,
                 gamma: float = 0.95,
                 epsilon: float = 1.0,
                 epsilon_decay: float = 0.995,
                 epsilon_min: float = 0.01):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        # Q-table (discretized)
        self.n_bins = 10
        self.q_table = {}
        
    def _discretize_state(self, state: np.ndarray) -> tuple:
        """Discretize continuous state."""
        # Clip and bin the state
        clipped = np.clip(state, -1, 1)
        binned = np.digitize(clipped, np.linspace(-1, 1, self.n_bins))
        return tuple(binned)
    
    def get_q_values(self, state: np.ndarray) -> np.ndarray:
        """Get Q-values for a state."""
        discrete_state = self._discretize_state(state)
        if discrete_state not in self.q_table:
            self.q_table[discrete_state] = np.zeros(self.action_size)
        return self.q_table[discrete_state]
    
    def choose_action(self, state: np.ndarray, training: bool = True) -> int:
        """Choose action using epsilon-greedy policy."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.action_size)
        
        q_values = self.get_q_values(state)
        return np.argmax(q_values)
    
    def learn(self, state: np.ndarray, action: int, 
              reward: float, next_state: np.ndarray, done: bool):
        """Update Q-values."""
        discrete_state = self._discretize_state(state)
        discrete_next_state = self._discretize_state(next_state)
        
        # Initialize if needed
        if discrete_state not in self.q_table:
            self.q_table[discrete_state] = np.zeros(self.action_size)
        if discrete_next_state not in self.q_table:
            self.q_table[discrete_next_state] = np.zeros(self.action_size)
        
        # Q-learning update
        current_q = self.q_table[discrete_state][action]
        
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.q_table[discrete_next_state])
        
        self.q_table[discrete_state][action] += self.learning_rate * (target_q - current_q)
    
    def decay_epsilon(self):
        """Decay exploration rate."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

print("QLearningAgent class defined")
# Generate data and train RL agent
def generate_trading_data(n_samples=1000):
    np.random.seed(42)
    returns = np.random.normal(0.0003, 0.015, n_samples)
    prices = 100 * np.exp(np.cumsum(returns))
    
    return pd.DataFrame({
        'close': prices,
        'volume': np.random.lognormal(15, 0.5, n_samples)
    })

df = generate_trading_data(1000)

# Create environment and agent
env = TradingEnvironment(df, window_size=10)
agent = QLearningAgent(
    state_size=env.state_size,
    action_size=env.action_space
)

# Training loop
n_episodes = 100
episode_rewards = []

print("Training RL agent...")
for episode in range(n_episodes):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        action = agent.choose_action(state, training=True)
        next_state, reward, done, info = env.step(action)
        
        agent.learn(state, action, reward, next_state, done)
        
        state = next_state
        total_reward += reward
    
    agent.decay_epsilon()
    episode_rewards.append(total_reward)
    
    if (episode + 1) % 20 == 0:
        avg_reward = np.mean(episode_rewards[-20:])
        print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.4f}, Epsilon = {agent.epsilon:.4f}")
# Evaluate trained agent
state = env.reset()
done = False
capitals = [env.initial_capital]
positions = [0]

while not done:
    action = agent.choose_action(state, training=False)
    next_state, reward, done, info = env.step(action)
    
    capitals.append(info['capital'])
    positions.append(info['position'])
    state = next_state

# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Training rewards
axes[0, 0].plot(episode_rewards)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Total Reward')
axes[0, 0].set_title('Training Progress')

# Equity curve
axes[0, 1].plot(capitals)
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Capital ($)')
axes[0, 1].set_title('RL Agent Equity Curve')

# Position over time
axes[1, 0].step(range(len(positions)), positions, where='post')
axes[1, 0].set_xlabel('Step')
axes[1, 0].set_ylabel('Position')
axes[1, 0].set_title('Positions Over Time')

# Compare with buy and hold
buy_hold = env.initial_capital * (df['close'] / df['close'].iloc[0])
axes[1, 1].plot(range(len(capitals)), capitals, label='RL Agent')
axes[1, 1].plot(range(len(buy_hold)), buy_hold.values, label='Buy & Hold', alpha=0.7)
axes[1, 1].set_xlabel('Step')
axes[1, 1].set_ylabel('Capital ($)')
axes[1, 1].set_title('RL vs Buy & Hold')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print(f"\nFinal Capital: ${capitals[-1]:,.2f}")
print(f"Total Return: {(capitals[-1] / capitals[0] - 1):.2%}")

Section 2: Online Learning

Online learning allows models to adapt continuously to new data without full retraining.

# Online Learning Framework
class OnlineLearner(ABC):
    """Abstract base class for online learning."""
    
    @abstractmethod
    def partial_fit(self, X: np.ndarray, y: np.ndarray):
        pass
    
    @abstractmethod
    def predict(self, X: np.ndarray) -> np.ndarray:
        pass


class OnlineSGDClassifier(OnlineLearner):
    """Online SGD classifier with adaptive learning."""
    
    def __init__(self, n_features: int, learning_rate: float = 0.01,
                 l2_reg: float = 0.001):
        self.n_features = n_features
        self.learning_rate = learning_rate
        self.l2_reg = l2_reg
        
        # Initialize weights
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        # Adaptive learning rate (AdaGrad)
        self.grad_squared = np.zeros(n_features)
        self.bias_grad_squared = 0.0
        
        self.n_samples_seen = 0
        
    def _sigmoid(self, z: np.ndarray) -> np.ndarray:
        """Sigmoid activation."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def partial_fit(self, X: np.ndarray, y: np.ndarray):
        """Update model with new samples."""
        X = np.atleast_2d(X)
        y = np.atleast_1d(y)
        
        for xi, yi in zip(X, y):
            # Forward pass
            z = np.dot(xi, self.weights) + self.bias
            pred = self._sigmoid(z)
            
            # Gradient
            error = pred - yi
            grad_w = error * xi + self.l2_reg * self.weights
            grad_b = error
            
            # AdaGrad update
            self.grad_squared += grad_w ** 2
            self.bias_grad_squared += grad_b ** 2
            
            adj_lr_w = self.learning_rate / (np.sqrt(self.grad_squared) + 1e-8)
            adj_lr_b = self.learning_rate / (np.sqrt(self.bias_grad_squared) + 1e-8)
            
            # Update weights
            self.weights -= adj_lr_w * grad_w
            self.bias -= adj_lr_b * grad_b
            
            self.n_samples_seen += 1
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict probabilities."""
        X = np.atleast_2d(X)
        z = np.dot(X, self.weights) + self.bias
        return self._sigmoid(z)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        return (self.predict_proba(X) >= 0.5).astype(int)

print("OnlineSGDClassifier defined")
# Online Learning with Concept Drift Detection
class DriftDetector:
    """Detect concept drift in streaming data."""
    
    def __init__(self, window_size: int = 100, threshold: float = 0.1):
        self.window_size = window_size
        self.threshold = threshold
        self.error_window = deque(maxlen=window_size)
        self.baseline_error = None
        
    def update(self, is_error: bool):
        """Update with new prediction result."""
        self.error_window.append(1 if is_error else 0)
        
        if len(self.error_window) == self.window_size:
            if self.baseline_error is None:
                self.baseline_error = np.mean(self.error_window)
    
    def is_drift(self) -> bool:
        """Check if drift has occurred."""
        if self.baseline_error is None or len(self.error_window) < self.window_size:
            return False
        
        current_error = np.mean(self.error_window)
        return (current_error - self.baseline_error) > self.threshold
    
    def reset_baseline(self):
        """Reset baseline after handling drift."""
        if len(self.error_window) >= self.window_size:
            self.baseline_error = np.mean(self.error_window)


class AdaptiveOnlineLearner:
    """Online learner with drift detection and adaptation."""
    
    def __init__(self, n_features: int, learning_rate: float = 0.01):
        self.model = OnlineSGDClassifier(n_features, learning_rate)
        self.drift_detector = DriftDetector(window_size=50, threshold=0.15)
        
        self.predictions = []
        self.actuals = []
        self.drift_points = []
        
    def predict_and_update(self, X: np.ndarray, y: int) -> int:
        """Make prediction and update model."""
        # Predict
        pred = self.model.predict(X.reshape(1, -1))[0]
        
        # Record
        self.predictions.append(pred)
        self.actuals.append(y)
        
        # Check for drift
        is_error = pred != y
        self.drift_detector.update(is_error)
        
        if self.drift_detector.is_drift():
            self.drift_points.append(len(self.predictions))
            # Increase learning rate temporarily
            self.model.learning_rate *= 2
            self.drift_detector.reset_baseline()
        
        # Update model
        self.model.partial_fit(X.reshape(1, -1), np.array([y]))
        
        # Decay learning rate
        self.model.learning_rate = max(0.001, self.model.learning_rate * 0.999)
        
        return pred
    
    def get_rolling_accuracy(self, window: int = 50) -> float:
        """Get rolling accuracy."""
        if len(self.predictions) < window:
            return None
        
        recent_pred = self.predictions[-window:]
        recent_actual = self.actuals[-window:]
        
        return np.mean(np.array(recent_pred) == np.array(recent_actual))

print("AdaptiveOnlineLearner defined")
# Test online learning with simulated concept drift
def generate_drift_data(n_samples=1000, drift_point=500):
    """Generate data with concept drift."""
    np.random.seed(42)
    
    X = np.random.randn(n_samples, 5)
    
    # Before drift: y depends on X[:, 0] + X[:, 1]
    y_before = ((X[:drift_point, 0] + X[:drift_point, 1]) > 0).astype(int)
    
    # After drift: y depends on X[:, 2] - X[:, 3]
    y_after = ((X[drift_point:, 2] - X[drift_point:, 3]) > 0).astype(int)
    
    y = np.concatenate([y_before, y_after])
    
    return X, y

X, y = generate_drift_data(1000, drift_point=500)

# Train adaptive online learner
online_learner = AdaptiveOnlineLearner(n_features=5, learning_rate=0.1)

rolling_accuracies = []

print("Training adaptive online learner...")
for i in range(len(X)):
    online_learner.predict_and_update(X[i], y[i])
    
    if i >= 50:
        acc = online_learner.get_rolling_accuracy(50)
        rolling_accuracies.append(acc)

print(f"\nDrift detected at points: {online_learner.drift_points}")
print(f"Final rolling accuracy: {rolling_accuracies[-1]:.4f}")
# Visualize online learning
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Rolling accuracy
axes[0].plot(range(50, len(X)), rolling_accuracies)
axes[0].axvline(x=500, color='red', linestyle='--', label='True Drift Point')
for dp in online_learner.drift_points:
    axes[0].axvline(x=dp, color='green', linestyle=':', alpha=0.7)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Rolling Accuracy (50)')
axes[0].set_title('Online Learning Accuracy with Concept Drift')
axes[0].legend()

# Cumulative accuracy
cumulative_correct = np.cumsum(np.array(online_learner.predictions) == np.array(online_learner.actuals))
cumulative_accuracy = cumulative_correct / np.arange(1, len(cumulative_correct) + 1)
axes[1].plot(cumulative_accuracy)
axes[1].axvline(x=500, color='red', linestyle='--', label='Drift Point')
axes[1].set_xlabel('Sample')
axes[1].set_ylabel('Cumulative Accuracy')
axes[1].set_title('Cumulative Accuracy Over Time')
axes[1].legend()

plt.tight_layout()
plt.show()

Section 3: Advanced Ensemble Methods

Advanced ensembles go beyond simple averaging to dynamically weight models based on recent performance.

# Dynamic Weighted Ensemble
class DynamicEnsemble:
    """Ensemble that dynamically adjusts weights based on performance."""
    
    def __init__(self, models: List, weight_decay: float = 0.95,
                 min_weight: float = 0.1):
        self.models = models
        self.n_models = len(models)
        self.weight_decay = weight_decay
        self.min_weight = min_weight
        
        # Initialize equal weights
        self.weights = np.ones(self.n_models) / self.n_models
        
        # Track performance
        self.model_correct = np.zeros(self.n_models)
        self.model_total = np.zeros(self.n_models)
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make weighted ensemble prediction."""
        predictions = np.array([m.predict(X) for m in self.models])
        
        # Weighted voting
        weighted_sum = np.dot(self.weights, predictions)
        return (weighted_sum >= 0.5).astype(int)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Get weighted probability."""
        probas = np.array([m.predict_proba(X)[:, 1] if hasattr(m, 'predict_proba') 
                          else m.predict(X) for m in self.models])
        return np.dot(self.weights, probas)
    
    def update_weights(self, X: np.ndarray, y_true: int):
        """Update weights based on individual model performance."""
        # Get individual predictions
        predictions = np.array([m.predict(X.reshape(1, -1))[0] for m in self.models])
        
        # Update performance tracking
        correct = predictions == y_true
        self.model_correct = self.weight_decay * self.model_correct + correct
        self.model_total = self.weight_decay * self.model_total + 1
        
        # Calculate new weights based on accuracy
        accuracies = self.model_correct / (self.model_total + 1e-8)
        
        # Ensure minimum weight
        accuracies = np.maximum(accuracies, self.min_weight)
        
        # Normalize weights
        self.weights = accuracies / accuracies.sum()
    
    def get_model_weights(self) -> Dict[int, float]:
        """Get current model weights."""
        return {i: w for i, w in enumerate(self.weights)}

print("DynamicEnsemble defined")
# Stacking Ensemble with Meta-Learner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler

class StackingEnsemble:
    """Stacking ensemble with meta-learner."""
    
    def __init__(self, base_models: List, meta_model=None):
        self.base_models = base_models
        self.meta_model = meta_model or LogisticRegression()
        self.scaler = StandardScaler()
        self.fitted = False
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit stacking ensemble."""
        # Scale features
        X_scaled = self.scaler.fit_transform(X)
        
        # Generate base model predictions using cross-validation
        meta_features = np.zeros((len(X), len(self.base_models)))
        
        for i, model in enumerate(self.base_models):
            # Get out-of-fold predictions
            meta_features[:, i] = cross_val_predict(
                model, X_scaled, y, cv=5, method='predict'
            )
            # Fit on full data
            model.fit(X_scaled, y)
        
        # Fit meta-learner
        self.meta_model.fit(meta_features, y)
        self.fitted = True
        
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make stacked prediction."""
        if not self.fitted:
            raise ValueError("Model not fitted")
        
        X_scaled = self.scaler.transform(X)
        
        # Get base model predictions
        meta_features = np.column_stack([
            model.predict(X_scaled) for model in self.base_models
        ])
        
        # Meta-learner prediction
        return self.meta_model.predict(meta_features)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Get probability predictions."""
        if not self.fitted:
            raise ValueError("Model not fitted")
        
        X_scaled = self.scaler.transform(X)
        
        meta_features = np.column_stack([
            model.predict(X_scaled) for model in self.base_models
        ])
        
        return self.meta_model.predict_proba(meta_features)

print("StackingEnsemble defined")
# Test stacking ensemble
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate data
X_class, y_class = make_classification(n_samples=1000, n_features=20,
                                        n_informative=10, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Create base models
base_models = [
    RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42),
    GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
    LogisticRegression(random_state=42)
]

# Train stacking ensemble
stacking = StackingEnsemble(base_models)
stacking.fit(X_train, y_train)

# Evaluate
stacking_pred = stacking.predict(X_test)
stacking_acc = accuracy_score(y_test, stacking_pred)

# Compare with individual models
print("Model Comparison:")
print(f"Stacking Ensemble: {stacking_acc:.4f}")

for i, model in enumerate(base_models):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model.fit(X_train_scaled, y_train)
    pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, pred)
    print(f"Model {i} ({type(model).__name__}): {acc:.4f}")

Section 4: Meta-Learning for Finance

Meta-learning aims to learn how to learn, enabling quick adaptation to new market regimes.

# Model Selection Meta-Learner
class ModelSelector:
    """Meta-learner that selects best model based on market conditions."""
    
    def __init__(self, models: Dict[str, Any]):
        self.models = models
        self.model_performance = {name: deque(maxlen=50) for name in models}
        self.regime_detector = None
        self.regime_model_map = {}  # Maps regime to best model
        
    def detect_regime(self, features: np.ndarray) -> str:
        """Detect current market regime."""
        # Simple regime detection based on volatility and trend
        volatility = np.std(features[-20:, 0]) if len(features) >= 20 else 0.01
        trend = np.mean(features[-10:, 0]) if len(features) >= 10 else 0
        
        if volatility > 0.02:
            return 'high_volatility'
        elif trend > 0.001:
            return 'bullish'
        elif trend < -0.001:
            return 'bearish'
        else:
            return 'sideways'
    
    def select_model(self, features: np.ndarray) -> str:
        """Select best model for current conditions."""
        regime = self.detect_regime(features)
        
        # Check if we have learned which model works best for this regime
        if regime in self.regime_model_map:
            return self.regime_model_map[regime]
        
        # Otherwise, select model with best recent performance
        best_model = None
        best_accuracy = -1
        
        for name, perf in self.model_performance.items():
            if len(perf) > 0:
                acc = np.mean(perf)
                if acc > best_accuracy:
                    best_accuracy = acc
                    best_model = name
        
        return best_model or list(self.models.keys())[0]
    
    def predict(self, X: np.ndarray, features_history: np.ndarray) -> np.ndarray:
        """Make prediction using selected model."""
        model_name = self.select_model(features_history)
        return self.models[model_name].predict(X)
    
    def update_performance(self, model_name: str, is_correct: bool,
                            features: np.ndarray):
        """Update model performance tracking."""
        self.model_performance[model_name].append(1 if is_correct else 0)
        
        # Update regime-model mapping
        regime = self.detect_regime(features)
        
        # Find best model for this regime
        best_acc = -1
        best_model = None
        
        for name, perf in self.model_performance.items():
            if len(perf) >= 10:
                acc = np.mean(list(perf)[-10:])
                if acc > best_acc:
                    best_acc = acc
                    best_model = name
        
        if best_model:
            self.regime_model_map[regime] = best_model

print("ModelSelector defined")
# Regime-Specific Model Training
class RegimeAdaptiveSystem:
    """System that adapts model based on detected regime."""
    
    def __init__(self):
        self.scaler = StandardScaler()
        self.regime_models = {
            'high_volatility': RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42),
            'bullish': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
            'bearish': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
            'sideways': LogisticRegression(random_state=42)
        }
        self.regime_data = {regime: {'X': [], 'y': []} for regime in self.regime_models}
        self.min_samples = 50
        
    def detect_regime(self, df: pd.DataFrame) -> str:
        """Detect market regime from price data."""
        if len(df) < 20:
            return 'sideways'
        
        returns = df['close'].pct_change().dropna()
        volatility = returns.std()
        trend = returns.mean()
        
        if volatility > 0.02:
            return 'high_volatility'
        elif trend > 0.001:
            return 'bullish'
        elif trend < -0.001:
            return 'bearish'
        else:
            return 'sideways'
    
    def add_sample(self, X: np.ndarray, y: int, regime: str):
        """Add training sample to regime-specific data."""
        self.regime_data[regime]['X'].append(X)
        self.regime_data[regime]['y'].append(y)
        
        # Retrain if enough samples
        if len(self.regime_data[regime]['X']) >= self.min_samples:
            self._retrain_regime(regime)
    
    def _retrain_regime(self, regime: str):
        """Retrain model for specific regime."""
        X = np.array(self.regime_data[regime]['X'])
        y = np.array(self.regime_data[regime]['y'])
        
        X_scaled = self.scaler.fit_transform(X)
        self.regime_models[regime].fit(X_scaled, y)
    
    def predict(self, X: np.ndarray, df: pd.DataFrame) -> int:
        """Predict using regime-appropriate model."""
        regime = self.detect_regime(df)
        
        # Check if model is trained
        if len(self.regime_data[regime]['X']) < self.min_samples:
            # Use default model
            regime = 'sideways'
        
        X_scaled = self.scaler.transform(X.reshape(1, -1))
        return self.regime_models[regime].predict(X_scaled)[0]

print("RegimeAdaptiveSystem defined")

Section 5: Module Project - Advanced Trading System

Build an advanced trading system combining RL, online learning, and ensemble methods.

# Advanced ML Trading System
class AdvancedMLTradingSystem:
    """Trading system combining multiple advanced ML techniques."""
    
    def __init__(self, initial_capital: float = 100000):
        self.initial_capital = initial_capital
        self.capital = initial_capital
        self.position = 0
        
        # Components
        self.scaler = StandardScaler()
        self.online_learner = None
        self.ensemble = None
        self.regime_system = RegimeAdaptiveSystem()
        
        # Tracking
        self.equity_curve = []
        self.signals = []
        self.regime_history = []
        
    def create_features(self, df: pd.DataFrame) -> np.ndarray:
        """Create features from price data."""
        data = df.copy()
        
        # Returns
        data['return_1d'] = data['close'].pct_change()
        data['return_5d'] = data['close'].pct_change(5)
        
        # Volatility
        data['volatility'] = data['return_1d'].rolling(20).std()
        
        # RSI
        delta = data['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
        
        # Price to SMA
        data['price_to_sma'] = data['close'] / data['close'].rolling(20).mean()
        
        feature_cols = ['return_1d', 'return_5d', 'volatility', 'rsi', 'price_to_sma']
        return data[feature_cols].fillna(0).values
    
    def initialize_models(self, train_data: pd.DataFrame):
        """Initialize all models with training data."""
        features = self.create_features(train_data)
        
        # Create target
        target = (train_data['close'].shift(-1) > train_data['close']).astype(int).values
        
        # Remove NaN
        valid_idx = ~np.isnan(features).any(axis=1)
        features = features[valid_idx]
        target = target[valid_idx][:-1]  # Remove last (no target)
        features = features[:-1]
        
        # Scale features
        self.scaler.fit(features)
        features_scaled = self.scaler.transform(features)
        
        # Initialize online learner
        self.online_learner = OnlineSGDClassifier(n_features=features.shape[1])
        for i in range(min(100, len(features))):
            self.online_learner.partial_fit(
                features_scaled[i:i+1], target[i:i+1]
            )
        
        # Initialize ensemble
        base_models = [
            RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42),
            GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
            LogisticRegression(random_state=42)
        ]
        
        for model in base_models:
            model.fit(features_scaled, target)
        
        self.ensemble = DynamicEnsemble(base_models)
        
        print("Models initialized")
    
    def generate_signal(self, current_data: pd.DataFrame) -> int:
        """Generate trading signal using ensemble of methods."""
        features = self.create_features(current_data)
        X = features[-1:]
        
        if np.isnan(X).any():
            return 0
        
        X_scaled = self.scaler.transform(X)
        
        # Get predictions from different methods
        online_pred = self.online_learner.predict(X_scaled)[0]
        ensemble_pred = self.ensemble.predict(X_scaled)[0]
        regime_pred = self.regime_system.predict(X_scaled[0], current_data)
        
        # Combine predictions (majority voting)
        votes = [online_pred, ensemble_pred, regime_pred]
        final_pred = 1 if sum(votes) >= 2 else 0
        
        return 1 if final_pred == 1 else -1
    
    def update_models(self, X: np.ndarray, y: int, df: pd.DataFrame):
        """Update models with new observation."""
        X_scaled = self.scaler.transform(X.reshape(1, -1))
        
        # Update online learner
        self.online_learner.partial_fit(X_scaled, np.array([y]))
        
        # Update ensemble weights
        self.ensemble.update_weights(X_scaled[0], y)
        
        # Update regime system
        regime = self.regime_system.detect_regime(df)
        self.regime_system.add_sample(X_scaled[0], y, regime)
    
    def trade(self, signal: int, price: float):
        """Execute trade based on signal."""
        if signal != self.position:
            # Close existing position
            # Open new position
            cost = abs(signal - self.position) * self.capital * 0.001
            self.capital -= cost
            self.position = signal
    
    def update_pnl(self, price_return: float):
        """Update P&L based on position."""
        pnl = self.capital * self.position * price_return
        self.capital += pnl
    
    def run_backtest(self, data: pd.DataFrame, lookback: int = 100):
        """Run backtest on historical data."""
        # Initialize with first portion
        self.initialize_models(data.iloc[:lookback])
        
        self.equity_curve = [self.initial_capital]
        
        for i in range(lookback, len(data) - 1):
            current_data = data.iloc[i-lookback:i+1]
            
            # Generate signal
            signal = self.generate_signal(current_data)
            self.signals.append(signal)
            
            # Record regime
            regime = self.regime_system.detect_regime(current_data)
            self.regime_history.append(regime)
            
            # Trade
            current_price = data.iloc[i]['close']
            self.trade(signal, current_price)
            
            # Update P&L
            next_price = data.iloc[i+1]['close']
            price_return = (next_price - current_price) / current_price
            self.update_pnl(price_return)
            
            self.equity_curve.append(self.capital)
            
            # Update models with actual outcome
            actual = 1 if next_price > current_price else 0
            features = self.create_features(current_data)
            if not np.isnan(features[-1]).any():
                self.update_models(features[-1], actual, current_data)
        
        return self.get_results()
    
    def get_results(self) -> Dict:
        """Get backtest results."""
        capitals = np.array(self.equity_curve)
        returns = np.diff(capitals) / capitals[:-1]
        
        return {
            'total_return': (self.capital / self.initial_capital) - 1,
            'sharpe_ratio': np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8),
            'max_drawdown': (capitals / np.maximum.accumulate(capitals) - 1).min(),
            'equity_curve': self.equity_curve,
            'signals': self.signals,
            'regime_history': self.regime_history
        }

print("AdvancedMLTradingSystem defined")
# Run advanced system backtest
# Generate longer dataset
full_data = generate_trading_data(2000)

# Initialize and run system
advanced_system = AdvancedMLTradingSystem(initial_capital=100000)
results = advanced_system.run_backtest(full_data, lookback=100)

print("\n" + "="*50)
print("Advanced ML Trading System Results")
print("="*50)
print(f"Total Return: {results['total_return']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {results['max_drawdown']:.2%}")
# Visualize advanced system results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Equity curve
axes[0, 0].plot(results['equity_curve'])
buy_hold = 100000 * (full_data['close'] / full_data['close'].iloc[0])
axes[0, 0].plot(buy_hold.iloc[100:].values, alpha=0.7, label='Buy & Hold')
axes[0, 0].set_xlabel('Step')
axes[0, 0].set_ylabel('Capital ($)')
axes[0, 0].set_title('Advanced System vs Buy & Hold')
axes[0, 0].legend(['Strategy', 'Buy & Hold'])

# Drawdown
capitals = np.array(results['equity_curve'])
peak = np.maximum.accumulate(capitals)
drawdown = (capitals - peak) / peak
axes[0, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown')

# Regime distribution
from collections import Counter
regime_counts = Counter(results['regime_history'])
axes[1, 0].bar(regime_counts.keys(), regime_counts.values())
axes[1, 0].set_xlabel('Regime')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Regime Distribution')
axes[1, 0].tick_params(axis='x', rotation=45)

# Signal distribution
signal_counts = Counter(results['signals'])
axes[1, 1].bar(['Short (-1)', 'Long (1)'], 
               [signal_counts.get(-1, 0), signal_counts.get(1, 0)])
axes[1, 1].set_xlabel('Signal')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Signal Distribution')

plt.tight_layout()
plt.show()

Exercises

Complete the following exercises to practice advanced ML techniques.

Exercise 14.1: Implement Reward Shaping (Guided)

Create a custom reward function for RL trading.

Exercise
Solution 14.1
def calculate_shaped_reward(position: int, price_return: float,
                            volatility: float, drawdown: float) -> float:
    # Calculate base return reward
    return_reward = position * price_return

    # Calculate risk-adjusted reward
    risk_adjusted = return_reward / (volatility + 1e-8)

    # Calculate drawdown penalty
    drawdown_penalty = -0.5 * abs(drawdown) if drawdown < -0.05 else 0

    # Combine rewards
    total_reward = risk_adjusted + drawdown_penalty

    return total_reward

Exercise 14.2: Implement Online Weight Update (Guided)

Create a function to update ensemble weights online.

Exercise
Solution 14.2
def update_ensemble_weights(weights: np.ndarray, predictions: np.ndarray,
                             actual: int, learning_rate: float = 0.1) -> np.ndarray:
    # Calculate which models were correct
    correct = (predictions == actual).astype(float)

    # Calculate reward/penalty for each model
    rewards = np.where(correct == 1, learning_rate, -learning_rate)

    # Update weights using exponential update
    new_weights = weights * np.exp(rewards)

    # Normalize weights to sum to 1
    new_weights = new_weights / new_weights.sum()

    return new_weights

Exercise 14.3: Implement Regime Detection (Guided)

Create a market regime detector.

Exercise
Solution 14.3
def detect_market_regime(prices: np.ndarray, window: int = 20) -> str:
    if len(prices) < window:
        return 'unknown'

    # Calculate returns
    returns = np.diff(prices) / prices[:-1]
    recent_returns = returns[-window:]

    # Calculate trend (average return)
    trend = np.mean(recent_returns)

    # Calculate volatility
    volatility = np.std(recent_returns)

    # Classify regime
    if volatility > 0.02:
        return 'high_volatility'
    elif trend > 0.001:
        return 'trending_up'
    elif trend < -0.001:
        return 'trending_down'
    else:
        return 'low_volatility'

Exercise 14.4: Build Experience Replay Buffer (Open-ended)

Create an experience replay buffer for deep RL.

Exercise
Solution 14.4
class ReplayBuffer:
    def __init__(self, capacity: int = 10000):
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done, priority=1.0):
        """Add experience to buffer."""
        experience = (state, action, reward, next_state, done)
        self.buffer.append(experience)
        self.priorities.append(priority)

    def sample(self, batch_size: int) -> List[Tuple]:
        """Sample random batch."""
        indices = np.random.choice(len(self.buffer), 
                                   size=min(batch_size, len(self.buffer)),
                                   replace=False)
        return [self.buffer[i] for i in indices]

    def sample_prioritized(self, batch_size: int, alpha: float = 0.6) -> List[Tuple]:
        """Sample with priority weighting."""
        priorities = np.array(self.priorities) ** alpha
        probs = priorities / priorities.sum()

        indices = np.random.choice(len(self.buffer),
                                   size=min(batch_size, len(self.buffer)),
                                   p=probs,
                                   replace=False)
        return [self.buffer[i] for i in indices]

    def update_priority(self, index: int, priority: float):
        """Update priority for an experience."""
        if index < len(self.priorities):
            self.priorities[index] = priority

    def __len__(self):
        return len(self.buffer)

# Usage
buffer = ReplayBuffer(capacity=10000)
buffer.push(np.zeros(5), 1, 0.1, np.zeros(5), False)
print(f"Buffer size: {len(buffer)}")

Exercise 14.5: Implement Bandit-Based Model Selection (Open-ended)

Create a multi-armed bandit for dynamic model selection.

Exercise
Solution 14.5
class ModelBandit:
    def __init__(self, n_models: int, exploration_param: float = 2.0):
        self.n_models = n_models
        self.exploration_param = exploration_param

        # Track performance
        self.n_selections = np.zeros(n_models)
        self.total_rewards = np.zeros(n_models)
        self.total_rounds = 0

    def select_model(self) -> int:
        """Select model using UCB algorithm."""
        self.total_rounds += 1

        # Ensure each model is tried at least once
        for i in range(self.n_models):
            if self.n_selections[i] == 0:
                return i

        # Calculate UCB scores
        avg_rewards = self.total_rewards / self.n_selections
        exploration_bonus = np.sqrt(
            self.exploration_param * np.log(self.total_rounds) / self.n_selections
        )
        ucb_scores = avg_rewards + exploration_bonus

        return np.argmax(ucb_scores)

    def update(self, model_idx: int, reward: float):
        """Update model statistics."""
        self.n_selections[model_idx] += 1
        self.total_rewards[model_idx] += reward

    def get_best_model(self) -> int:
        """Get best performing model."""
        avg_rewards = self.total_rewards / (self.n_selections + 1e-8)
        return np.argmax(avg_rewards)

    def get_model_stats(self) -> pd.DataFrame:
        """Get statistics for all models."""
        return pd.DataFrame({
            'model': range(self.n_models),
            'n_selections': self.n_selections,
            'total_rewards': self.total_rewards,
            'avg_reward': self.total_rewards / (self.n_selections + 1e-8)
        })

# Usage
bandit = ModelBandit(n_models=3)
for _ in range(100):
    model_idx = bandit.select_model()
    reward = np.random.random()  # Simulated reward
    bandit.update(model_idx, reward)
print(bandit.get_model_stats())

Exercise 14.6: Create Adaptive Learning Rate Scheduler (Open-ended)

Build a learning rate scheduler that adapts to market conditions.

Exercise
Solution 14.6
class MarketAdaptiveLRScheduler:
    def __init__(self, initial_lr: float = 0.01,
                 min_lr: float = 0.0001,
                 max_lr: float = 0.1,
                 warmup_steps: int = 100):
        self.initial_lr = initial_lr
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.warmup_steps = warmup_steps

        self.current_lr = initial_lr
        self.step_count = 0
        self.volatility_history = deque(maxlen=50)
        self.baseline_volatility = None

    def step(self, volatility: float) -> float:
        """Update and return learning rate."""
        self.step_count += 1
        self.volatility_history.append(volatility)

        # Warmup phase
        if self.step_count <= self.warmup_steps:
            warmup_factor = self.step_count / self.warmup_steps
            self.current_lr = self.initial_lr * warmup_factor
            return self.current_lr

        # Set baseline after warmup
        if self.baseline_volatility is None:
            self.baseline_volatility = np.mean(self.volatility_history)

        # Adapt based on volatility
        current_vol = np.mean(list(self.volatility_history)[-10:])
        vol_ratio = current_vol / (self.baseline_volatility + 1e-8)

        # Higher volatility -> higher learning rate (faster adaptation)
        if vol_ratio > 1.5:  # High volatility
            self.current_lr = min(self.current_lr * 1.1, self.max_lr)
        elif vol_ratio < 0.7:  # Low volatility
            self.current_lr = max(self.current_lr * 0.95, self.min_lr)

        return self.current_lr

    def get_lr(self) -> float:
        return self.current_lr

    def reset(self):
        self.current_lr = self.initial_lr
        self.step_count = 0
        self.volatility_history.clear()
        self.baseline_volatility = None

# Usage
scheduler = MarketAdaptiveLRScheduler()
for i in range(200):
    vol = 0.01 if i < 100 else 0.03  # Volatility spike at step 100
    lr = scheduler.step(vol)
    if i % 50 == 0:
        print(f"Step {i}: LR = {lr:.6f}")

Summary

In this module, you learned:

  1. Reinforcement Learning: Using RL to learn trading policies through interaction with market environments

  2. Online Learning: Continuously adapting models to new data with drift detection

  3. Advanced Ensembles: Dynamic weighting and stacking for improved predictions

  4. Meta-Learning: Learning which models work best in different market conditions

  5. Production Systems: Combining multiple techniques into robust trading systems

Key Takeaways

  • RL can discover trading strategies that supervised learning might miss
  • Online learning enables continuous adaptation without full retraining
  • Dynamic ensembles outperform static ensembles in changing markets
  • Meta-learning helps select the right model for current conditions
  • Combining multiple techniques provides the most robust results

Course Completion

Congratulations! You have completed the Machine Learning for Financial Markets course. You now have a comprehensive understanding of applying ML techniques to trading and investment problems.

Capstone Project: End-to-End ML Trading System

Project Overview

In this capstone project, you will build a complete machine learning trading system from scratch. This project integrates all concepts from the course including data preprocessing, feature engineering, model selection, backtesting, and production deployment.

Learning Objectives

By completing this project, you will demonstrate: - Comprehensive feature engineering for financial data - Multiple ML model implementation and comparison - Proper walk-forward backtesting methodology - Production-ready system architecture - Performance analysis and risk management

Project Requirements

Build a trading system that: 1. Processes raw market data into ML-ready features 2. Trains and evaluates multiple model types 3. Implements proper walk-forward validation 4. Includes realistic transaction costs 5. Provides comprehensive performance analysis 6. Is designed for production deployment

Estimated Time: 6-8 hours


Part 1: Setup and Data Generation

Set up the project environment and generate realistic market data.

# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Any, Optional
from dataclasses import dataclass, field
from collections import deque
from abc import ABC, abstractmethod
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

np.random.seed(42)
print("All libraries loaded successfully")
print(f"Project started: {datetime.now()}")
# Generate comprehensive market data
def generate_market_data(n_days=2500, n_assets=3):
    """
    Generate realistic multi-asset market data with:
    - Regime switching
    - Volatility clustering
    - Cross-asset correlations
    """
    np.random.seed(42)
    
    dates = pd.date_range(start='2015-01-01', periods=n_days, freq='D')
    
    # Create regime series
    regime = np.zeros(n_days)
    current_regime = 0
    for i in range(n_days):
        if np.random.random() < 0.005:  # 0.5% chance to switch
            current_regime = 1 - current_regime
        regime[i] = current_regime
    
    assets = {}
    asset_names = ['STOCK_A', 'STOCK_B', 'STOCK_C'][:n_assets]
    
    # Correlation matrix
    correlation = np.array([
        [1.0, 0.6, 0.3],
        [0.6, 1.0, 0.4],
        [0.3, 0.4, 1.0]
    ])[:n_assets, :n_assets]
    
    # Generate correlated returns
    L = np.linalg.cholesky(correlation)
    uncorr_returns = np.random.normal(0, 1, (n_days, n_assets))
    corr_returns = uncorr_returns @ L.T
    
    for idx, asset in enumerate(asset_names):
        # Base parameters vary by regime
        base_return = np.where(regime == 0, 0.0004, -0.0001)
        base_vol = np.where(regime == 0, 0.012, 0.020)
        
        # Apply volatility clustering
        volatility = np.zeros(n_days)
        volatility[0] = base_vol[0]
        for i in range(1, n_days):
            volatility[i] = 0.9 * volatility[i-1] + 0.1 * base_vol[i]
        
        # Generate returns
        returns = base_return + volatility * corr_returns[:, idx]
        
        # Generate prices
        prices = 100 * np.exp(np.cumsum(returns))
        
        # Create OHLCV
        daily_range = volatility * np.random.uniform(0.5, 1.5, n_days)
        
        assets[asset] = pd.DataFrame({
            'date': dates,
            'open': np.roll(prices, 1),
            'high': prices * (1 + daily_range),
            'low': prices * (1 - daily_range),
            'close': prices,
            'volume': np.random.lognormal(15, 0.5, n_days) * (1 + regime * 0.5),
            'regime': regime
        })
        assets[asset].loc[0, 'open'] = assets[asset].loc[0, 'close']
        assets[asset].set_index('date', inplace=True)
    
    return assets

# Generate data
market_data = generate_market_data(n_days=2500, n_assets=3)

print(f"Generated data for {len(market_data)} assets")
for asset, df in market_data.items():
    print(f"  {asset}: {len(df)} days from {df.index[0].date()} to {df.index[-1].date()}")
# Visualize market data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Price series
for asset, df in market_data.items():
    axes[0, 0].plot(df.index, df['close'], label=asset)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Price')
axes[0, 0].set_title('Asset Prices')
axes[0, 0].legend()

# Returns distribution
for asset, df in market_data.items():
    returns = df['close'].pct_change().dropna()
    axes[0, 1].hist(returns, bins=50, alpha=0.5, label=asset)
axes[0, 1].set_xlabel('Daily Return')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Return Distributions')
axes[0, 1].legend()

# Regime over time (first asset)
first_asset = list(market_data.keys())[0]
axes[1, 0].fill_between(market_data[first_asset].index, 
                        market_data[first_asset]['regime'], 
                        alpha=0.5)
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('Regime')
axes[1, 0].set_title('Market Regime (0=Bull, 1=Bear)')

# Rolling volatility
for asset, df in market_data.items():
    returns = df['close'].pct_change()
    rolling_vol = returns.rolling(20).std() * np.sqrt(252)
    axes[1, 1].plot(df.index, rolling_vol, label=asset)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Annualized Volatility')
axes[1, 1].set_title('Rolling 20-Day Volatility')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

Part 2: Feature Engineering Pipeline

Build a comprehensive feature engineering pipeline.

# Feature Engineering Class
class FeatureEngineer:
    """Comprehensive feature engineering for trading."""
    
    def __init__(self):
        self.feature_names = []
        
    def create_price_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create price-based features."""
        data = df.copy()
        
        # Returns at multiple horizons
        for period in [1, 2, 3, 5, 10, 20]:
            data[f'return_{period}d'] = data['close'].pct_change(period)
        
        # Log returns
        data['log_return_1d'] = np.log(data['close'] / data['close'].shift(1))
        
        # Price momentum
        for period in [5, 10, 20, 50]:
            data[f'momentum_{period}d'] = data['close'] / data['close'].shift(period) - 1
        
        return data
    
    def create_volatility_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create volatility-based features."""
        data = df.copy()
        returns = data['close'].pct_change()
        
        # Rolling volatility
        for period in [5, 10, 20, 50]:
            data[f'volatility_{period}d'] = returns.rolling(period).std()
        
        # Volatility ratio
        data['volatility_ratio'] = data['volatility_5d'] / (data['volatility_20d'] + 1e-10)
        
        # Parkinson volatility (using high/low)
        data['parkinson_vol'] = np.sqrt(
            (1 / (4 * np.log(2))) * 
            (np.log(data['high'] / data['low']) ** 2).rolling(20).mean()
        )
        
        # Average True Range
        high_low = data['high'] - data['low']
        high_close = abs(data['high'] - data['close'].shift(1))
        low_close = abs(data['low'] - data['close'].shift(1))
        tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
        data['atr_14'] = tr.rolling(14).mean()
        data['atr_normalized'] = data['atr_14'] / data['close']
        
        return data
    
    def create_technical_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create technical indicator features."""
        data = df.copy()
        
        # Moving averages
        for period in [5, 10, 20, 50, 200]:
            data[f'sma_{period}'] = data['close'].rolling(period).mean()
            data[f'ema_{period}'] = data['close'].ewm(span=period).mean()
            data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
        
        # MACD
        exp12 = data['close'].ewm(span=12).mean()
        exp26 = data['close'].ewm(span=26).mean()
        data['macd'] = exp12 - exp26
        data['macd_signal'] = data['macd'].ewm(span=9).mean()
        data['macd_hist'] = data['macd'] - data['macd_signal']
        
        # RSI
        delta = data['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        rs = gain / (loss + 1e-10)
        data['rsi_14'] = 100 - (100 / (1 + rs))
        
        # Stochastic Oscillator
        low_14 = data['low'].rolling(14).min()
        high_14 = data['high'].rolling(14).max()
        data['stoch_k'] = 100 * (data['close'] - low_14) / (high_14 - low_14 + 1e-10)
        data['stoch_d'] = data['stoch_k'].rolling(3).mean()
        
        # Bollinger Bands
        bb_sma = data['close'].rolling(20).mean()
        bb_std = data['close'].rolling(20).std()
        data['bb_upper'] = bb_sma + 2 * bb_std
        data['bb_lower'] = bb_sma - 2 * bb_std
        data['bb_width'] = (data['bb_upper'] - data['bb_lower']) / bb_sma
        data['bb_position'] = (data['close'] - data['bb_lower']) / (data['bb_upper'] - data['bb_lower'] + 1e-10)
        
        return data
    
    def create_volume_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create volume-based features."""
        data = df.copy()
        
        # Volume moving averages
        for period in [5, 10, 20]:
            data[f'volume_sma_{period}'] = data['volume'].rolling(period).mean()
        
        # Volume ratio
        data['volume_ratio'] = data['volume'] / data['volume_sma_20']
        
        # On-Balance Volume (OBV)
        obv = np.where(data['close'] > data['close'].shift(1), data['volume'],
                       np.where(data['close'] < data['close'].shift(1), -data['volume'], 0))
        data['obv'] = np.cumsum(obv)
        data['obv_sma'] = pd.Series(data['obv']).rolling(20).mean().values
        
        # Volume-Price Trend
        data['vpt'] = (data['volume'] * data['close'].pct_change()).cumsum()
        
        return data
    
    def create_all_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create all features."""
        data = df.copy()
        
        data = self.create_price_features(data)
        data = self.create_volatility_features(data)
        data = self.create_technical_features(data)
        data = self.create_volume_features(data)
        
        # Create target (next day direction)
        data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
        data['target_return'] = data['close'].pct_change().shift(-1)
        
        # Store feature names
        exclude_cols = ['open', 'high', 'low', 'close', 'volume', 'regime',
                        'target', 'target_return'] + \
                       [c for c in data.columns if 'sma_' in c and 'price_to' not in c] + \
                       [c for c in data.columns if 'ema_' in c] + \
                       ['bb_upper', 'bb_lower', 'volume_sma_5', 'volume_sma_10', 
                        'volume_sma_20', 'obv', 'obv_sma', 'vpt']
        
        self.feature_names = [c for c in data.columns if c not in exclude_cols]
        
        return data
    
    def get_feature_names(self) -> List[str]:
        """Get list of feature names."""
        return self.feature_names

# Create features for primary asset
feature_engineer = FeatureEngineer()
primary_asset = 'STOCK_A'
df_features = feature_engineer.create_all_features(market_data[primary_asset])

print(f"\nCreated {len(feature_engineer.get_feature_names())} features:")
for i, name in enumerate(feature_engineer.get_feature_names()):
    print(f"  {i+1}. {name}")
# TODO: Complete the feature correlation analysis
# Analyze feature correlations and select the most important features

def analyze_feature_importance(df: pd.DataFrame, feature_names: List[str], 
                                target_col: str = 'target') -> pd.DataFrame:
    """
    Analyze feature importance and correlation with target.
    
    Returns DataFrame with:
    - Feature correlations with target
    - Feature correlations with each other (to detect multicollinearity)
    """
    # YOUR CODE HERE
    # 1. Calculate correlation of each feature with target
    # 2. Identify highly correlated feature pairs
    # 3. Return sorted importance scores
    
    valid_data = df[feature_names + [target_col]].dropna()
    
    # Calculate correlations with target
    target_corr = valid_data[feature_names].corrwith(valid_data[target_col]).abs()
    
    # Create importance DataFrame
    importance = pd.DataFrame({
        'feature': feature_names,
        'target_correlation': target_corr.values
    }).sort_values('target_correlation', ascending=False)
    
    return importance

# Analyze features
feature_importance = analyze_feature_importance(
    df_features, feature_engineer.get_feature_names()
)

print("\nTop 15 Features by Target Correlation:")
print(feature_importance.head(15).to_string(index=False))

Part 3: Model Training and Selection

Train multiple models and select the best one.

# Model Training Framework
class ModelTrainer:
    """Framework for training and comparing models."""
    
    def __init__(self, feature_names: List[str]):
        self.feature_names = feature_names
        self.scaler = StandardScaler()
        self.models = {}
        self.results = {}
        
    def prepare_data(self, df: pd.DataFrame, target_col: str = 'target'):
        """Prepare data for training."""
        # Get features and target
        valid_mask = ~df[self.feature_names + [target_col]].isna().any(axis=1)
        valid_data = df[valid_mask].copy()
        
        X = valid_data[self.feature_names].values
        y = valid_data[target_col].values
        dates = valid_data.index
        
        return X, y, dates
    
    def train_test_split(self, X: np.ndarray, y: np.ndarray, 
                          dates: pd.DatetimeIndex, train_ratio: float = 0.7):
        """Time-based train/test split."""
        split_idx = int(len(X) * train_ratio)
        
        X_train, X_test = X[:split_idx], X[split_idx:]
        y_train, y_test = y[:split_idx], y[split_idx:]
        dates_train, dates_test = dates[:split_idx], dates[split_idx:]
        
        # Scale features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        return (X_train_scaled, X_test_scaled, y_train, y_test, 
                dates_train, dates_test)
    
    def train_model(self, name: str, model, X_train: np.ndarray, 
                    y_train: np.ndarray):
        """Train a model."""
        model.fit(X_train, y_train)
        self.models[name] = model
        return model
    
    def evaluate_model(self, name: str, X_test: np.ndarray, 
                       y_test: np.ndarray) -> Dict:
        """Evaluate a trained model."""
        model = self.models[name]
        predictions = model.predict(X_test)
        
        results = {
            'accuracy': accuracy_score(y_test, predictions),
            'precision': precision_score(y_test, predictions, zero_division=0),
            'recall': recall_score(y_test, predictions, zero_division=0),
            'f1': f1_score(y_test, predictions, zero_division=0),
            'predictions': predictions
        }
        
        self.results[name] = results
        return results
    
    def compare_models(self) -> pd.DataFrame:
        """Compare all trained models."""
        comparison = []
        for name, results in self.results.items():
            comparison.append({
                'model': name,
                'accuracy': results['accuracy'],
                'precision': results['precision'],
                'recall': results['recall'],
                'f1': results['f1']
            })
        return pd.DataFrame(comparison).sort_values('f1', ascending=False)

print("ModelTrainer class defined")
# TODO: Train and compare multiple models

# Select top features
top_features = feature_importance.head(20)['feature'].tolist()

# Initialize trainer
trainer = ModelTrainer(top_features)

# Prepare data
X, y, dates = trainer.prepare_data(df_features)

# Split data
(X_train, X_test, y_train, y_test, 
 dates_train, dates_test) = trainer.train_test_split(X, y, dates)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

# Define models to train
models_to_train = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
}

# Train and evaluate models
print("\nTraining models...")
for name, model in models_to_train.items():
    print(f"  Training {name}...")
    trainer.train_model(name, model, X_train, y_train)
    results = trainer.evaluate_model(name, X_test, y_test)
    print(f"    Accuracy: {results['accuracy']:.4f}, F1: {results['f1']:.4f}")

# Compare models
print("\nModel Comparison:")
comparison = trainer.compare_models()
print(comparison.to_string(index=False))

Part 4: Walk-Forward Backtesting

Implement proper walk-forward validation.

# Walk-Forward Backtester
class WalkForwardBacktester:
    """Walk-forward backtesting with realistic assumptions."""
    
    def __init__(self, model, feature_names: List[str],
                 train_window: int = 252, test_window: int = 21,
                 step_size: int = 21, commission: float = 0.001):
        self.model = model
        self.feature_names = feature_names
        self.train_window = train_window
        self.test_window = test_window
        self.step_size = step_size
        self.commission = commission
        self.scaler = StandardScaler()
        
        self.fold_results = []
        self.all_predictions = None
        
    def run(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run walk-forward backtest."""
        # Prepare data
        valid_mask = ~df[self.feature_names + ['target']].isna().any(axis=1)
        data = df[valid_mask].copy()
        
        X = data[self.feature_names].values
        y = data['target'].values
        
        n_samples = len(X)
        predictions = np.full(n_samples, np.nan)
        probabilities = np.full(n_samples, np.nan)
        
        start_idx = self.train_window
        fold = 0
        
        while start_idx + self.test_window <= n_samples:
            # Define windows
            train_start = max(0, start_idx - self.train_window)
            train_end = start_idx
            test_start = start_idx
            test_end = min(start_idx + self.test_window, n_samples)
            
            # Get data
            X_train = X[train_start:train_end]
            y_train = y[train_start:train_end]
            X_test = X[test_start:test_end]
            y_test = y[test_start:test_end]
            
            # Scale
            X_train_scaled = self.scaler.fit_transform(X_train)
            X_test_scaled = self.scaler.transform(X_test)
            
            # Train
            self.model.fit(X_train_scaled, y_train)
            
            # Predict
            pred = self.model.predict(X_test_scaled)
            prob = self.model.predict_proba(X_test_scaled)[:, 1]
            
            predictions[test_start:test_end] = pred
            probabilities[test_start:test_end] = prob
            
            # Record fold results
            self.fold_results.append({
                'fold': fold,
                'train_start': data.index[train_start],
                'test_start': data.index[test_start],
                'test_end': data.index[test_end-1],
                'accuracy': accuracy_score(y_test, pred)
            })
            
            fold += 1
            start_idx += self.step_size
        
        # Store results
        results = data.copy()
        results['prediction'] = predictions
        results['probability'] = probabilities
        results['signal'] = np.where(predictions == 1, 1, -1)
        
        self.all_predictions = results
        return results
    
    def calculate_backtest_metrics(self, initial_capital: float = 100000) -> Dict:
        """Calculate backtest performance metrics."""
        if self.all_predictions is None:
            raise ValueError("Run backtest first")
        
        results = self.all_predictions.dropna(subset=['signal']).copy()
        
        # Calculate strategy returns
        position = results['signal'].values
        returns = results['target_return'].values
        
        # Account for position changes (transaction costs)
        position_changes = np.abs(np.diff(np.concatenate([[0], position])))
        costs = position_changes * self.commission
        
        # Strategy returns
        strategy_returns = position * returns - costs
        
        # Calculate metrics
        cumulative_returns = (1 + strategy_returns).cumprod()
        
        total_return = cumulative_returns[-1] - 1 if len(cumulative_returns) > 0 else 0
        sharpe = np.sqrt(252) * np.mean(strategy_returns) / (np.std(strategy_returns) + 1e-8)
        
        # Max drawdown
        peak = np.maximum.accumulate(cumulative_returns)
        drawdown = (cumulative_returns - peak) / peak
        max_drawdown = np.min(drawdown)
        
        # Win rate
        win_rate = np.mean(strategy_returns > 0)
        
        # Trade statistics
        n_trades = np.sum(position_changes > 0)
        total_costs = np.sum(costs) * initial_capital
        
        return {
            'total_return': total_return,
            'sharpe_ratio': sharpe,
            'max_drawdown': max_drawdown,
            'win_rate': win_rate,
            'n_trades': n_trades,
            'total_costs': total_costs,
            'avg_fold_accuracy': np.mean([f['accuracy'] for f in self.fold_results]),
            'cumulative_returns': cumulative_returns,
            'strategy_returns': strategy_returns
        }

print("WalkForwardBacktester class defined")
# TODO: Run walk-forward backtest

# Select best model from comparison
best_model_name = comparison.iloc[0]['model']
print(f"Using best model: {best_model_name}")

# Create fresh model instance
if 'Random Forest' in best_model_name:
    backtest_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
elif 'Gradient Boosting' in best_model_name:
    backtest_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
else:
    backtest_model = LogisticRegression(random_state=42, max_iter=1000)

# Initialize backtester
backtester = WalkForwardBacktester(
    model=backtest_model,
    feature_names=top_features,
    train_window=252,
    test_window=21,
    step_size=21,
    commission=0.001
)

# Run backtest
print("\nRunning walk-forward backtest...")
backtest_results = backtester.run(df_features)

# Calculate metrics
metrics = backtester.calculate_backtest_metrics(initial_capital=100000)

print("\n" + "="*50)
print("WALK-FORWARD BACKTEST RESULTS")
print("="*50)
print(f"Total Return: {metrics['total_return']:.2%}")
print(f"Sharpe Ratio: {metrics['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {metrics['max_drawdown']:.2%}")
print(f"Win Rate: {metrics['win_rate']:.2%}")
print(f"Number of Trades: {metrics['n_trades']:.0f}")
print(f"Total Costs: ${metrics['total_costs']:,.2f}")
print(f"Average Fold Accuracy: {metrics['avg_fold_accuracy']:.4f}")
# Visualize backtest results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Cumulative returns
valid_results = backtest_results.dropna(subset=['signal'])
strategy_cum = metrics['cumulative_returns']
buy_hold_cum = (1 + valid_results['target_return']).cumprod()

axes[0, 0].plot(range(len(strategy_cum)), strategy_cum, label='Strategy', linewidth=2)
axes[0, 0].plot(range(len(buy_hold_cum)), buy_hold_cum.values, label='Buy & Hold', alpha=0.7)
axes[0, 0].set_xlabel('Trading Days')
axes[0, 0].set_ylabel('Cumulative Return')
axes[0, 0].set_title('Strategy vs Buy & Hold')
axes[0, 0].legend()

# Drawdown
peak = np.maximum.accumulate(strategy_cum)
drawdown = (strategy_cum - peak) / peak
axes[0, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Trading Days')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Strategy Drawdown')

# Fold accuracy over time
fold_df = pd.DataFrame(backtester.fold_results)
axes[1, 0].bar(range(len(fold_df)), fold_df['accuracy'])
axes[1, 0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[1, 0].axhline(y=fold_df['accuracy'].mean(), color='green', linestyle='--', label='Average')
axes[1, 0].set_xlabel('Fold')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Walk-Forward Accuracy by Fold')
axes[1, 0].legend()

# Monthly returns heatmap
strategy_returns = pd.Series(metrics['strategy_returns'], index=valid_results.index)
monthly = strategy_returns.resample('M').sum()
colors = ['green' if r > 0 else 'red' for r in monthly.values]
axes[1, 1].bar(range(len(monthly)), monthly.values, color=colors, alpha=0.7)
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Monthly Return')
axes[1, 1].set_title('Monthly Returns')

plt.tight_layout()
plt.show()

Part 5: Production System Design

Design a production-ready system architecture.

# TODO: Build a complete production trading system

class ProductionTradingSystem:
    """
    Complete production ML trading system.
    
    This system should include:
    - Feature pipeline with versioning
    - Model registry with version control
    - Real-time prediction service
    - Performance monitoring
    - Alert system
    """
    
    def __init__(self, model, feature_names: List[str],
                 initial_capital: float = 100000):
        # Your implementation here
        self.model = model
        self.feature_names = feature_names
        self.initial_capital = initial_capital
        self.capital = initial_capital
        self.position = 0
        
        self.scaler = StandardScaler()
        self.feature_engineer = FeatureEngineer()
        
        # Tracking
        self.predictions_log = []
        self.trades_log = []
        self.equity_curve = [initial_capital]
        self.alerts = []
        
        # Performance monitoring
        self.rolling_accuracy = deque(maxlen=50)
        
    def fit(self, train_data: pd.DataFrame):
        """Fit the system on training data."""
        # Create features
        df_features = self.feature_engineer.create_all_features(train_data)
        
        # Prepare data
        valid_mask = ~df_features[self.feature_names + ['target']].isna().any(axis=1)
        valid_data = df_features[valid_mask]
        
        X = valid_data[self.feature_names].values
        y = valid_data['target'].values
        
        # Fit scaler and model
        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)
        
        print(f"System fitted on {len(X)} samples")
        
    def predict(self, current_data: pd.DataFrame) -> Dict:
        """Generate prediction for current data."""
        try:
            # Create features
            df_features = self.feature_engineer.create_all_features(current_data)
            
            # Get latest features
            X = df_features[self.feature_names].iloc[-1:].values
            
            if np.isnan(X).any():
                return {'status': 'error', 'message': 'NaN in features'}
            
            # Scale and predict
            X_scaled = self.scaler.transform(X)
            prediction = self.model.predict(X_scaled)[0]
            probability = self.model.predict_proba(X_scaled)[0, 1]
            
            signal = 1 if prediction == 1 else -1
            
            result = {
                'status': 'success',
                'prediction': int(prediction),
                'probability': float(probability),
                'signal': signal,
                'timestamp': datetime.now()
            }
            
            # Log prediction
            self.predictions_log.append(result)
            
            return result
            
        except Exception as e:
            return {'status': 'error', 'message': str(e)}
    
    def update_with_actual(self, actual: int):
        """Update system with actual outcome."""
        if self.predictions_log:
            last_pred = self.predictions_log[-1]['prediction']
            is_correct = last_pred == actual
            self.rolling_accuracy.append(1 if is_correct else 0)
            
            # Check for performance degradation
            if len(self.rolling_accuracy) >= 20:
                acc = np.mean(self.rolling_accuracy)
                if acc < 0.45:
                    self.alerts.append({
                        'timestamp': datetime.now(),
                        'type': 'performance_degradation',
                        'message': f'Rolling accuracy dropped to {acc:.2%}'
                    })
    
    def trade(self, signal: int, price: float):
        """Execute trade based on signal."""
        if signal != self.position:
            # Calculate cost
            cost = abs(signal - self.position) * self.capital * 0.001
            self.capital -= cost
            
            trade = {
                'timestamp': datetime.now(),
                'old_position': self.position,
                'new_position': signal,
                'price': price,
                'cost': cost
            }
            self.trades_log.append(trade)
            self.position = signal
    
    def update_pnl(self, price_return: float):
        """Update P&L based on position."""
        pnl = self.capital * self.position * price_return
        self.capital += pnl
        self.equity_curve.append(self.capital)
    
    def get_status(self) -> Dict:
        """Get system status."""
        return {
            'capital': self.capital,
            'position': self.position,
            'total_return': (self.capital / self.initial_capital) - 1,
            'n_predictions': len(self.predictions_log),
            'n_trades': len(self.trades_log),
            'rolling_accuracy': np.mean(self.rolling_accuracy) if self.rolling_accuracy else None,
            'n_alerts': len(self.alerts)
        }

print("ProductionTradingSystem class defined")
# Test production system
# Split data for production simulation
train_data = market_data[primary_asset].iloc[:1500]
test_data = market_data[primary_asset].iloc[1500:]

# Initialize production system
prod_system = ProductionTradingSystem(
    model=RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    feature_names=top_features,
    initial_capital=100000
)

# Fit on training data
prod_system.fit(train_data)

# Simulate live trading
print("\nSimulating live trading...")
lookback = 100

for i in range(lookback, len(test_data) - 1):
    # Get current data window
    current_data = test_data.iloc[i-lookback:i+1]
    current_price = test_data.iloc[i]['close']
    
    # Generate prediction
    result = prod_system.predict(current_data)
    
    if result['status'] == 'success':
        # Trade
        prod_system.trade(result['signal'], current_price)
        
        # Update P&L
        if i > lookback:
            prev_price = test_data.iloc[i-1]['close']
            price_return = (current_price - prev_price) / prev_price
            prod_system.update_pnl(price_return)
        
        # Update with actual
        next_price = test_data.iloc[i+1]['close']
        actual = 1 if next_price > current_price else 0
        prod_system.update_with_actual(actual)

# Get final status
status = prod_system.get_status()
print("\n" + "="*50)
print("PRODUCTION SYSTEM STATUS")
print("="*50)
for key, value in status.items():
    if 'return' in key or 'accuracy' in key:
        print(f"{key}: {value:.2%}" if value else f"{key}: N/A")
    elif 'capital' in key:
        print(f"{key}: ${value:,.2f}")
    else:
        print(f"{key}: {value}")

Part 6: Final Analysis and Report

Generate comprehensive analysis and final report.

# Generate final comprehensive report

def generate_final_report(backtester: WalkForwardBacktester,
                          prod_system: ProductionTradingSystem,
                          model_comparison: pd.DataFrame) -> str:
    """Generate comprehensive project report."""
    
    bt_metrics = backtester.calculate_backtest_metrics()
    prod_status = prod_system.get_status()
    
    report = f"""
{'='*60}
CAPSTONE PROJECT: END-TO-END ML TRADING SYSTEM
Final Report
{'='*60}
Generated: {datetime.now()}

{'='*60}
1. MODEL SELECTION
{'='*60}
{model_comparison.to_string(index=False)}

Best Model: {model_comparison.iloc[0]['model']}

{'='*60}
2. WALK-FORWARD BACKTEST RESULTS
{'='*60}
Total Return: {bt_metrics['total_return']:.2%}
Sharpe Ratio: {bt_metrics['sharpe_ratio']:.2f}
Max Drawdown: {bt_metrics['max_drawdown']:.2%}
Win Rate: {bt_metrics['win_rate']:.2%}
Number of Trades: {bt_metrics['n_trades']:.0f}
Total Transaction Costs: ${bt_metrics['total_costs']:,.2f}
Average Fold Accuracy: {bt_metrics['avg_fold_accuracy']:.4f}

{'='*60}
3. PRODUCTION SIMULATION RESULTS
{'='*60}
Final Capital: ${prod_status['capital']:,.2f}
Total Return: {prod_status['total_return']:.2%}
Total Predictions: {prod_status['n_predictions']}
Total Trades: {prod_status['n_trades']}
Rolling Accuracy: {prod_status['rolling_accuracy']:.2%}" if prod_status['rolling_accuracy'] else "N/A"
Alerts Generated: {prod_status['n_alerts']}

{'='*60}
4. KEY FINDINGS
{'='*60}
- The {model_comparison.iloc[0]['model']} model achieved the best F1 score
- Walk-forward validation shows realistic out-of-sample performance
- Transaction costs significantly impact overall returns
- System includes monitoring for performance degradation

{'='*60}
5. RECOMMENDATIONS
{'='*60}
- Consider additional features (sentiment, alternative data)
- Implement ensemble methods for more robust predictions
- Add regime detection for adaptive model selection
- Monitor for concept drift and retrain periodically

{'='*60}
END OF REPORT
{'='*60}
"""
    return report

# Generate and print report
final_report = generate_final_report(backtester, prod_system, comparison)
print(final_report)
# Final visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# 1. Model comparison
x = range(len(comparison))
axes[0, 0].bar(x, comparison['f1'])
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(comparison['model'], rotation=45, ha='right')
axes[0, 0].set_ylabel('F1 Score')
axes[0, 0].set_title('Model Comparison')

# 2. Backtest equity curve
axes[0, 1].plot(metrics['cumulative_returns'])
axes[0, 1].set_xlabel('Trading Days')
axes[0, 1].set_ylabel('Cumulative Return')
axes[0, 1].set_title('Backtest Equity Curve')

# 3. Production equity curve
axes[0, 2].plot(prod_system.equity_curve)
axes[0, 2].set_xlabel('Trading Days')
axes[0, 2].set_ylabel('Capital ($)')
axes[0, 2].set_title('Production Simulation')

# 4. Fold accuracy distribution
fold_accuracies = [f['accuracy'] for f in backtester.fold_results]
axes[1, 0].hist(fold_accuracies, bins=20, edgecolor='black')
axes[1, 0].axvline(x=0.5, color='red', linestyle='--')
axes[1, 0].set_xlabel('Accuracy')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Walk-Forward Accuracy Distribution')

# 5. Rolling accuracy over time (production)
if prod_system.rolling_accuracy:
    rolling_acc = list(prod_system.rolling_accuracy)
    cumulative_acc = np.cumsum(rolling_acc) / np.arange(1, len(rolling_acc) + 1)
    axes[1, 1].plot(cumulative_acc)
    axes[1, 1].axhline(y=0.5, color='red', linestyle='--')
    axes[1, 1].set_xlabel('Prediction Number')
    axes[1, 1].set_ylabel('Cumulative Accuracy')
    axes[1, 1].set_title('Production Accuracy Over Time')

# 6. Trade distribution
if prod_system.trades_log:
    positions = [t['new_position'] for t in prod_system.trades_log]
    unique, counts = np.unique(positions, return_counts=True)
    labels = ['Short' if p == -1 else 'Long' for p in unique]
    axes[1, 2].bar(labels, counts)
    axes[1, 2].set_ylabel('Count')
    axes[1, 2].set_title('Trade Distribution')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("CAPSTONE PROJECT COMPLETED")
print("="*60)

Project Summary

In this capstone project, you built a complete end-to-end ML trading system that includes:

  1. Data Generation: Created realistic multi-asset market data with regime switching and volatility clustering

  2. Feature Engineering: Built a comprehensive feature pipeline with price, volatility, technical, and volume features

  3. Model Training: Trained and compared multiple ML models (Logistic Regression, Random Forest, Gradient Boosting)

  4. Walk-Forward Backtesting: Implemented proper walk-forward validation with realistic transaction costs

  5. Production System: Designed a production-ready system with prediction, trading, and monitoring capabilities

  6. Performance Analysis: Generated comprehensive analysis and reporting

Key Takeaways

  • Proper feature engineering is crucial for ML trading systems
  • Walk-forward validation provides realistic performance estimates
  • Transaction costs significantly impact strategy returns
  • Production systems require monitoring and alert mechanisms
  • Continuous improvement and adaptation are essential

Congratulations!

You have successfully completed the Machine Learning for Financial Markets course. You now have the skills to build, backtest, and deploy ML trading systems.